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CHAPTER  I 
INTRODUCTION 

Quality  surveillance  of  military  petroleum  products 
is  the  aggregate  of  measures  to  be  applied  to  determine  and 
maintain  their  quality.   Quality  surveillance  programs  are 
conducted  in  order  that  required  petroleum  products  will  be 
available  in  a  condition  suitable  for  immediate  use.   Their 
ultimate  purpose  is:   (1)  to  insure  that  no  life  is  ever 
lost  or  equipment  damaged  or  destroyed  through  the  use  of 
contaminated  or  deteriorated  petroleum  products,  and  (2)  to 
promote  economy  by  minimizing  the  necessity  of  surveying  or 
reclaiming  any  petroleum  products  because  of  contamination 
or  deterioration. 

The  success  of  any  such  program  is  dependent  upon 
several  factors,  not  the  least  of  which  is  the  maintenance 
of  the  highest  standards  of  reliability  in  the  testing  lab- 
oratories . 

Importance  of  Laboratory  Reliability 

A  chemical  analysis  has  been  compared  to  an  elastic 
yardstick  never  giving  the  same  result  twice.    How  is  one 
to  know  then  if  laboratory  tests  are  "right"?   The  fact  is 
that  any  decision  regarding  a  petroleum  product  based  on 
laboratory  test  results  is  a  decision  under  uncertainty. 
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Decisions  under  uncertainty  always  involve  a  risk  of  making 
the  wrong  decision.   This  is  of  particular  concern  when  test 
results  of  a  petroleum  product  border  on  acceptability 
limits.   In  order  to  properly  evaluate  the  risk  of  misclassi- 
fying  borderline  material,  it  is  important  to  know  how  much 
stretch  or  shrinkage  to  allow  in  reported  test  results. 
Common  sense  dictates  that  it  is  also  important  to  reduce  the 
risk  by  reducing  the  elasticity  of  the  yardstick  as  much  as 
is  economically  feasible.   The  economics  of  reliability 
control  are  probably  most  apparent  when  considering  a  com- 
mercial application  for  which  the  costs  of  reliability  control 
and  the  costs  of  wrong  decisions  can  be  quite  accurately  com- 
puted.  Consider  a  refinery  laboratory  where  small  devia- 
tions could  be  expensive  ones.   When  mixing  a  blend,  a  small 
excess  per  sample  unit  of  an  expensive  component  could  add 
up  to  many  dollars  in  excess  costs  in  a  continuous  process. 
J.  T.  Walter  cites  a  report  by  one  refinery  of  losses  of  one 

million  dollars  per  year  on  a  single  operation  due  to  quality 

2 

give-away.         Conversely,    a    deficiency   could   cause    rejection 

of    a    product   by  a    customer    and   add   the    costs   of    reprocessing 
to   the    product. 

In   regard   to   military   applications,    consider    the    cost 
of   delay    in   discharging   a    tanker's    cargo   or    defueling   a    ship 
while    additional    samples   are    tested,    if    the    first    sample 
results    indicated   that    the    quality  was    suspect.      As    a   more 
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sobering  example,  we  might  visualize  a  heavily  loaded  air- 
craft faltering  on  take-off  and  crashing  because  of  loss  of 

power  due  to  vapor  lock.   This  could  result  from  mis-classi- 

3 

fication  of  unfit  fuel  based  on  unreliable  test  results. 

From  these  examples,  the  importance  of  the  reliability 
of  laboratory  test  results  in  any  attempt  to  control  quality 
should  be  evident. 

A  Current  Effort  to  Control  Military  Laboratory  Reliability 
At  the  command  level,  the  maintenance  of  the  highest 
standards  of  reliability  in  testing  laboratories  is  depend- 
ent upon  the  ability  to  detect  apparent  trends  toward  unre- 
liability.  In  pursuit  of  this  goal,  a  correlation  testing 
program  has  been  set  up  within  a  major  military  area  command 
as  a  part  of  its  quality  surveillance  program.   Identical 
samples  of  aviation  gasoline,  motor  gasoline,  jet  fuel, 
diesel  fuel  and  lubricating  oil  are  prepared  and  distributed 
tri-annually  to  each  of  ten  participating  laboratories.   The 
results  are  summarized  and  the  average  value  of  all  observa- 
tions is  determined  for  each  test.   Reproducability  limits 
are  then  computed  for  those  tests  for  which  a  method  of 
determining  reproducability  limits  is  given  in  the  applicable 
American  Society  for  Testing  and  Materials  (ASTM)  Standard. 
Reproducability  limits  can  be  computed  for  about  seventy 
five  per  cent  of  the  tests.   The  test  results  falling  out- 
side of  these  limits  are  indicated  by  an  asterisk.   A  Summary 


of  Laboratory  Performance  is  prepared  which  tabulates  by 
activity,  the  number  of  tests  reported  for  which  reproduc- 
ability  limits  are  computed  and  the  per  cent  within  reproduc- 
ability  limits.   Each  summary  includes  the  tabular  data  for 
each  of  the  two  preceding  series  of  tests  as  well  as  for  the 
current  series. 

Purpose  of  the  Thesis 

The  purpose  of  this  thesis  is  to  investigate  some 
statistical  methods  of  treating  the  data  obtained  through 
the  military  area  command  correlation  testing  program 
described  above  to  extract  more  definitive  information  from 
them  concerning  the  reliability  of  the  participating  labora- 
tories' test  results. 


CHAPTER  II 


RELIABILITY 


This  chapter  discusses  types  of  measurement  error  and 
their  effects,  and  defines  the  associated  terminology  as  it 
will  be  used  throughout  the  following  chapters. 

Also  defined  are  repeatability  and  reproducability  as 
used  by  the  American  Society  for  Testing  and  Materials. 

CAUSES  OF  UNRELIABILITY 

Scarborough  points  out  that  all  measurements  are  sub- 
ject to  three  kinds  of  error:   systematic  or  constant  errors, 

4 
mistakes,  and  accidental  errors. 

Systematic  Errors 

Systematic  or  constant  errors  are  those  which  affect 
all  measurements  alike.   In  regard  to  laboratory  test  results, 
they  could  for  example,  be  due  to  improperly  calibrated  equip- 
ment or  due  to  consistent  but  incorrect  operative  techniques. 
Systematic  errors  are  usually  evident  as  a  constant  bias. 

Mistakes 

Mistakes  or  blunders  are  due  to  carelessness  primarily 
in  making  or  recording  observations.   The  fact  that  they  do 
not  follow  any  law  makes  gross  blunders  recognizable  as 
isolated  data  points.   Minor  mistakes,  however,  may  be  dif- 
ficult to  detect. 


Accidental  Errors 

Accidental  errors  are  those  whose  causes  are  unknown 
or  undetermined.   They  are  usually  small  and  they  are  con- 
sidered to  follow  the  laws  of  chance.   Consequently  they  are 
also  referred  to  as  chance  errors  or  random  errors. 

The  mathematical  theory  of  errors  deals  with  acciden- 
tal errors  only.   That  is  to  say,  systematic  errors  and  gross 
blunders  are  due  to  assignable  causes  and  can  therefore  be 
optionally  eliminated,  controlled,  or  accepted.   Accidental 
errors  however,  cannot  be  avoided  and  are  bound  to  occur  with 
a  measurable  probability. 

COMPONENTS  OF  RELIABILITY 

Reliability,  precision,  and  accuracy  have  been  defined 
in  various  ways.   All  are  comparative  or  relative  terms 
rather  than  absolute  measures.   Arbitrary  scales  for  their 
measurement  must  be  established  based  on  predetermined  stand- 
ards . 

Precision 

Precision  is  a  quality  of  a  set  of  data  that  describes 
the  degree  of  dispersion  of  the  values.   The  lower  the  dis- 
persion or  scatter,  the  higher  the  precision.   Single  mea- 
surements cannot  be  considered  to  be  "precise"  or  "not 
precise . " 
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Accuracy 

Accuracy  is  a  quality  of  a  single  measurement  or  a 
series  of  measurements  that  expresses  the  degree  to  which 
the  single  measurement  (or  the  average  of  the  set  of  measure- 
ments) conforms  to  a  predetermined  "true"  value.   High 
accuracy  implies  close  agreement  to  the  predetermined  stand- 
ard. 

Target  Analogy 

The  relationship  between  precision  and  accuracy  is 
best  explained  through  use  of  the  target  analogy. 

Figure  2-1  illustrates  four  groupings  of  twelve  shots 
in  a  target.   Target  A  illustrates  a  grouping  which  is  pre- 
cise but  not  accurate.   The  shots  are  in  a  tight  cluster  but 
considerably  removed  from  the  center  of  the  target  area. 
This  is  analogous  to  the  accompanying  frequency  histogram  of 
laboratory  measurements  in  which  the  measurements  are  grouped 
close  together  but  their  average  value  is  considerably  removed 
from  the  true  value  of  the  property  being  measured. 

Target  B  illustrates  accuracy  without  precision.   The 
shots  cluster  around  the  center  of  the  target  in  a  random 
fashion  but  are  widely  scattered.   Likewise  in  the  accompany- 
ing frequency  histogram,  measurements  are  relatively  evenly 
distributed  around  the  true  value  but  are  relatively  widely 
dispersed. 
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FIGURE  2-1 
TARGET  ANALOGY:   PRECISION  AND  ACCURACY 
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Target  C  illustrates  a  dispersion  of  shots  which  is 
neither  accurate  nor  precise.   Again  the  shots  are  widely- 
scattered  and  also  do  not  form  a  uniformly  dense  pattern 
around  the  center  of  the  target  as  they  did  in  B. 

Target  D  illustrates  good  marksmanship,  that  is 
marksmanship  that  shows  high  precision  (tight  clustering) 
and  high  accuracy  (good  centering) . 

Standards 

The  ASTM  Standards  on  Petroleum  Products  and  Lubri- 
cants provide  convenient  standards  of  precision  in  the  form 
of  Repeatability  and  Reproducibility  amounts  given  with  the 
description  of  the  test  method.   Repeatability,  is  defined 
by  them  (ASTM)  as  the  greatest  difference  between  two  single 
and  independent  results  by  a  single  operator  in  a  given 
laboratory  that  can  be  considered  acceptable  at  the  ninety 
five  per  cent  confidence  level.   Reproducibility,  is  defined 
as  the  greatest  difference  between  a  single  test  result 
obtained  in  one  laboratory  and  a  single  test  result  obtained 
in  another  laboratory  that  can  be  considered  acceptable  at 
the  ninety  five  per  cent  confidence  level. 


CHAPTER  III 
FUNDAMENTAL  STATISTICAL  MEASURES 

Introduction 

This  chapter  briefly  discusses  the  fundamental  stat- 
istical measures  which  are  applied  or  considered  in  later 
chapters . 

In  the  first  part  of  the  chapter  the  measures  are 
defined.   Methods  of  estimating  population  parameters  from 
sample  statistics  are  presented  in  the  next  section  followed 
by  a  comparison  of  the  relative  efficiency  of  the  various 
estimators.   Finally  a  discussion  is  given  of  some  of  the 
advantages  and  disadvantages  to  be  considered  when  choosing 
each  statistic  or  estimator. 

Frequent  reference  will  be  made  to  normal  populations 
or  distributions  of  values.   The  theory  of  the  normal  dis- 
tribution stemmed  from  work  done  by  Karl  Gauss  and,  for  this 
reason,  the  normal  distribution  is  sometimes  identified  as 
the  Gaussian  distribution.   The  normal  curve  is  defined 
mathematically  as 

2 

f  (X)  =  — - —  exponential  -  (x  ~  ^  (3-1) 

aVTrf  2  c 

in  which  p,  is  the  mean  value  of  the  variable  and  a  is  the 

standard  deviation,  both  of  which  are  described  in  this 

chapter . 


II 

In  the  context  of  equation  (3-1)  f(X)  is  known  as  a 

"probability  density  function."   For  any  probability  density 

function,  f (X),  the  probability  that  a  value  of  X  lies  in  the 

interval  XT  <  X  <  X   is  given  by  P(XT  <  X  <  X  ) 
L  —    _u     ^       -*   x  L  —    —   u 


X 


P(XT  <  X  <  Xn)  =  f     U   f  (X)  dX 

XT 


L 


u 


(3-2) 


Thus,  the  probability  that  a  value  X  lies  between  limits  X 
and  X   is  equal  to  the  area  under  the  probability  density 
function  f(X)  between  the  two  limits.   This  area  is  shown 
in  Figure  3-1. 


L 


FIGURE  3-1 

A  PROBABILITY  DENSITY  FUNCTION  SHOWING  THE  AREA  EQUIVALENT 
TO  THE  PROBABILITY  THAT  X  LIES  BETWEEN  XL  AND  Xy 


A  "normal  distribution"  for  a  variable  such  as  X 
signifies  that  the  probability  of  X  being  between  any  two 
limits  XT  and  X   is  given  by  equation  (3-2)  if  one  uses 
equation  (3-1)  for  the  definition  of  f(X). 
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POPULATION  PARAMETERS 

Measures  of  Central  Tendency 

A  universe  or  population  is  the  totality  of  all  per- 
tinent observations  that  might  be  made  in  a  given  problem. 
If  these  observations  are  normally  distributed,  they  will 
be  symetrically  dispersed  around  an  "average"  or  central 
value.   The  central  tendency  of  the  population  is  of  funda- 
mental interest  in  any  statistical  analysis. 

The  ARITHMETIC  MEAN  or  ARITHMETIC  AVERAGE,  p.,  of  a  set 
of  N  values.  X.,  is  defined  as  the  sum  of  the  set  of  values, 

divided  by  the  number  of  values  in  the  set. 

N 
E  X. 

POPULATION  MEAN  =  y,  =  ~ (3-3) 

The  arithmetic  mean  is  the  most  commonly  used  measure 
of  central  tendency  and  is  the  value  generally  intended  when 
the  term  "average"  or  "mean"  is  mentioned. 

The  MEDIAN  is  the  middle  value  of  a  set  of  numbers 
arranged  in  ascending  or  descending  order  according  to  value. 
For  an  even  number  of  data  points,  it  is  the  arithmetic  aver- 
age of  the  two  middle  values. 

50%  of  values  <  M  <  50%  of  values 

The  MIDRANGE  is  a  point  halfway  between  the  largest 
and  smallest  observations.   It  is  computed  as  the  average  of 
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the  first  and  last  values  of  a  set,  ordered  according  to 
value 

Xl  +  *N 


MIDRANGE 


Where  X1  <  X2  <.  .  .<  X^    (3-4) 
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For  a  normally  distributed  population,  the  arithmetic 
mean,  median,  mode  and  midrange  have  the  same  value. 

Measures  of  Dispersion 

The  second  of  the  two  most  fundamental  measures  in 
statistical  analysis  is  dispersion.   Dispersion  is  a  measure 
of  the  extent  to  which  the  pertinent  observations  comprising 
the  population  are  scattered  around  a  measure  of  central 
tendency.   It  may  be  viewed  as  a  measure  of  precision  or  the 
consistency  of,  or  the  variation  in,  a  set  of  measurements. 

The  RANGE  is  the  simplest  measure  of  general  vari- 
ability.  This  is  the  difference  between  the  highest  and 
lowest  value  of  an  entire  set  of  measurements. 

RANGE  =  w  =  X^  -  X1    Where  X,  <  X2  <.  .  .<  X^       (3-5) 

The  AVERAGE  DEVIATION  is  the  arithmetic  mean  of  the 
absolute  deviation  of  each  value  of  a  set  of  data  from   the 
central  value. 


AVERAGE  DEVIATION  =  A.D.  = 


N 

i 


X 


i-1 


N 


(3-6) 


The  VARIANCE,  or  MEAN-SQUARE  DEVIATION,  is  the  aver- 
age of  the  squared  deviations  from  the  mean. 
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N  7 

S  (X.  -  p,r 

VARIANCE  =  a2  = (3-7) 

From  a  mathematical  standpoint  the  variance  is  the 
basic  measure  of  the  distribution,  but  a  very  frequently 
used  measure  of  dispersion  is  the  STANDARD  DEVIATION,  or 
ROOT  MEAN -SQUARE  DEVIATION  which  is  the  difference  between 
the  mean  and  the  point  of  inflection  of  a  normal  curve.   The 
standard  deviation  is  defined  as  the  positive  square  root  of 
the  variance. 


STANDARD  DEVIATION  =  a  ^   Q2  (3-8) 

ESTIMATING  POPULATION  PARAMETERS 

A  statistical  estimation  problem  involves  selecting, 
on  the  basis  of  sample  information,  an  estimate  which  approx- 
imates the  value  of  a  population  parameter.   Estimators  are 
used  when  practical  considerations  militate  against  direct 
measurement  of  the  population  parameter.   If  the  cost  of 
testing  exceeds  the  value  of  the  added  benefits,  it  is 
uneconomical  to  measure  the  parameter  directly.   If  the  popu- 
lation is  infinite,  measurement  of  all  samples  is  physically 
impossible.   If  the  test  required  to  measure  a  particular 
property  alters,  consumes  or  otherwise  destroys  the  product, 
measurement  of  all  samples  is  not  useful.   These  considera- 
tions apply  to  testing  of  bulk  petroleum  products. 
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The  problem  of  determining  the  "best"  estimator  is 
varied  by  the  circumstances  of  the  situation.   In  general, 
the  "best"  estimator  is  one  which  has  a  distribution  con- 
centrated near  the  true  value  of  the  parameter  and  which  can 
be  applied  economically. 

Among  the  statistical  criteria  for  evaluating  esti- 
mators are  unbiasedness ,  consistency,  and  efficiency. 

The  bias  of  an  estimator  is  the  difference  between  the 
mean  of  the  distribution  of  the  estimator  and  the  true  value 
of  the  parameter  being  estimated.   An  unbiased  estimator  then 
is  one  which  has  a  distribution  having  a  mean  value  exactly 
equal  to  that  of  the  parameter  being  estimated. 

An  estimator  is  consistent  if  the  probability  that  an 
estimate  will  vary  from  the  true  value  of  the  parameter  by 
more  than  any  given  amount  can  be  made  arbitrarily  small  by 
increasing  the  number  of  observations  in  the  sample.   More 
simply  stated,  an  estimator  is  said  to  be  consistent  if  the 
reliability  of  the  estimate  becomes  greater  as  the  sample 
size  is  increased. 

The  efficiency  of  an  estimator  is  a  relative  criterion 
which  will  be  discussed  in  a  later  section. 

Estimators  of  the  Population  Mean 

The  sample  mean,  or  arithmetic  average,  is  an  unbiased 
estimator  of  the  population  mean  for  any  type  of  population. 
For  a  normally  distributed  population,  the  sample  median  and 
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the  sample  midrange  are  also  unbiased  estimators  of  the  popu- 
lation mean.   The  purpose  of  the  estimator  is  to  approximate 
the  value  of  a  population  parameter,  however,  the  presence 
of  extreme  values  in  a  set  of  sample  observations  (particu- 
larly a  small  set)  could  greatly  distort  the  estimate.   To 
minimize  distortion,  various  modifications  of  the  mean, 
median,  and  midrange  may  be  computed.   These  modifications 
are  variously  identified  in  the  literature  but  the  majority 
follow  two  general  patterns; 

a.  Outlying  data  in  a  set  are  excluded  from  com- 
putation of  the  mean,  median  or  midrange. 

b.  An  equal  number  of  values  from  the  lower  and 
upper  ends  of  an  ordered  set  are  excluded  from 
computation  of  the  mean,  or  midrange. 

The  elimination  of  equal  numbers  of  values  from  both 
the  high  and  low  ends  of  the  ordered  set  will  not  of  course 
change  the  median.   It  should  also  be  obvious  that  the  median 
is  a  special  case  of  both  the  symmetrically  modified  mean  and 
the  symmetrically  modified  midrange.   Given  a  set  of  six 
values,  the  following  symmetrically  modified  means  may  be 
generated: 

(X   +  X   +  X   +  X  ) 

Exclude  X,  and  Xc  =  „XC  = —. —         (3-9) 

i        b    Z     o  4 


(X3  +  V 

Exclude  X±,    X2,  X5 ,  X&  =  3X4  =  ±-^ —  -  Median     (3-10) 
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Again  using  a  set  of  six  values,  the  following  sym- 
metrically modified  midranges  may  be  generated: 

(X?    +  X  ) 

Exclude  X,  and  Xr    =  0CC  =  x —  (3-11) 

1        6    2  5         2 

(X  +  xA) 
Exclude  Xx,  X2,  X5 ,  X&  =  <Z.    =  r —   =  Median    (3-12) 

General  equation  for  computation  of  symmetrically  modified 

mean : 

N-A 
Z   X. 

(A  +  1)X(N  -  A)    (N-2A)  (3-13) 

Where:   A  =  number  of  values  to  be  eliminated  from  each 

end  of  the  ordered  set. 
General  equation  for  computation  of  symmetrically  modified 
midrange : 

c         =  X(A  +  1)  +  X(N  -  A) 
(A  +  1)*-(N  -  a)  2  K*    L^' 

The  principal  advantage  of  arbitrarily  discarding  data 
from  both  ends  of  an  ordered  set  is  the  simplicity  of  the 
procedure.   It  has  the  disadvantage  of  automatically  reduc- 
ing the  effective  size  of  the  sample,  discarding  good  data 
along  with  any  "bad"  data.   For  the  most  scientifically 

accurate  work,  statisticians  prefer  to  discard  members  of  a 

5 
sample  set  on  an  individual  basis.    This  may  be  limited  to 
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eliminating  only  those  values  known  to  have  been  influenced 
by  some  cause  foreign  to  the  rest  of  the  set.   It  may  also  be 
accomplished  by  following  some  statistical  rule  by  which 

values  can  be  discarded  with  a  predetermined  error  risk.   The 

6 
method  of  Dixon   for  testing  extreme  values,  being  a  nonpara- 

metric  test,  requires  only  the  available  sample  observations. 

Dixon's  method  makes  use  of  critical  values  of  ratios  of 

differences  to  be  expected  at  various  probability  levels  and 

for  different  sample  sizes.   If  the  observations  in  the 

sample  are  ranked  in  order  of  magnitude  as  follows: 

X,   <  X0  <  .   .   .  <  X   ,  <  X 

1     2  n-1    n 

the  ratio  for  testing  the  smallest  extreme  is: 


Xl^i  "  xl 


(3-15) 


11    x    .  -  X-, 

J  n- j     1 

and  the  ratio  for  testing  the  largest  extreme  is: 

x   -  x 

_   n n-i  ,  ~  ,  c  v 

r.  .  -  (3-16) 

ij   x   -  x,  .  . 

J     n     lTj 

The  appropriate  ratio  for  various  sample  sizes  is: 
sample  size  3  to  7    :   r,  n 

sample  size  8  to  10   :   r.  . 

sample  size  11  to  13  :   r~, 

sample  size  14  to  30  :   r~„ 
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Tables  giving  the  maximum  expected  values  for  Dixon's 
ratios  are  widely  reproduced  in  statistical  texts.   If  an 
observed  ratio  exceeds  the  maximum  expected  ratio,  the 
extreme  value  may  be  rejected  with  the  risk  of  error  set  by 
the  tabulated  probability  level.   Another  method  based  on 
statistical  probability  is  the  trial  and  error  method.   This 
method  requires  an  independent  estimate  of  standard  devia- 
tion.  A  trial  mean  is  computed  from  all  the  observations  in 
the  sample.   Confidence  limits  at  some  reasonable  level,  say 
ninety  five  per  cent,  are  then  set  around  the  trial  mean. 
Any  extreme  data  point  outside  the  ninety  five  per  cent  con- 
fidence interval  is  assumed  not  to  have  come  from  the  same 
population  as  the  rest  of  the  data  and  is  rejected.   A  new 
trial  mean  and  confidence  interval  are  determined  based  on 
the  remaining  data.   The  entire  original  set  of  observations 
is  tested  against  the  new  confidence  limit  and  additional  data 
points  are  rejected  and/or  previously  rejected  data  points 
are  picked  up.   The  process  is  repeated  until  a  stable  set 
of  values  is  established,  that  is,  no  additional  data  points 
are  picked  up  or  rejected  by  the  newly  computed  confidence 
interval . 

Estimators  of  Population  Dispersion 

Since  the  sample  mean  may  not  be  identical  with  the 
population  mean,  the  sum  of  squares  of  deviation  of  the 
individual  sample  values  from  the  sample  mean  will  be  less 
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than  the  sum  of  squares  of  deviation  of  the  individual  sample 
values  from  the  population  mean.   The  variance  of  the  sample, 
computed  from  the  sum  of  squares  of  deviation  divided  by  n, 
the  number  of  items  in  the  sample,  will  therefore  be  smaller 
than  if  the  sum  of  squares  has  been  calculated  from  the  true 
population  mean.   To  overcome  this  bias,  the  population 
variance  is  estimated  from  a  sample  by  dividing  the  sum  of 
squares  of  deviation  by  n  -  1  instead  of  n. 


ESTIMATED  POPULATION  VARIANCE  &2  =  S2 


(3-17) 


,2    _ 


n  -  1 


(s  ) 


n 


_ 


n  -  1 


n 

'•: 

i 

(X±    - 

x)2 

n 

(3-18) 


An  unbiased  estimate  of  the  population  standard  devi- 
ation can  be  obtained  by  multiplying  the  square  root  of  the 

estimated  population  variance  by  a  correction  factor  which 

7 
varies  with  the  type  of  distribution  and  the  sample  size. 


ESTIMATED  POPULATION  STANDARD  DEVIATION 

o 

For  a  normally  distributed  population: 


n    =    2;    k      -    1.253 


S        K 


:nV7 


(3-19) 


n    =    3;    k      =    1.128 
n 


n    =   4;    k      =   1.085 
n 


n    >    4;    k       =    1     +    —, r-r 

n  4(n   -    1 ) 
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The  sample  range,  w,  multiplied  by  the  appropriate 
correction  factor  forms  an  unbiased  estimator  of  the  popula- 
tion standard  deviation.   Tables  giving  correction  factors 

to  be  applied  to  the  range  can  be  found  in  readily  available 

9  10 

textbooks   and  handbooks    and  appear  to  be  based  on  work 

done  by  Pearson." 

The  sample  average  deviation,  A.D.,  multiplied  by  a 

correction  factor  forms  another  unbiased  estimator  of  the 

population  standard  deviation.   Still  another,  and  one  which 

is  easier  to  compute  than  the  average  deviation,  is  the 

modified  linear  estimator.   Tables  of  average  deviation 

estimators  and  modified  linear  estimators  were  developed  by 

Dixon  and  have  been  published  in  at  least  one  book  which  he 

12 
has  co-authored. 

Table  I  summarizes  the  range,  average  deviation  and 
modified  linear  estimators  of  the  population  standard  devia- 
tion for  sample  sizes  two  through  ten. 
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TABLE.  I 

UNBIASED  ESTIMATORS  OF  THE  POPULATION 
STANDARD  DEVIATION 


Sample         _.  A.D.  from         Modified 

«-  •  Range  R,  , . 

Size  ^  Median  Linear 

0.8862  DIFA  0.8862  DIFC 

0.5908  DIFB  0.5908  DIFC 

0.3770  DIFA  0.4857  DIFC 

0.3016  DIFB  0.4299  DIFC 

0.2369  DIFA  0.2619  DIFD 

0.2031  DIFB  0.2370  DIFD 

0.1723  DIFA  0.2197  DIFD 

0.1532  DIFB  0.2068  DIFD 

0.1353  DIFA  0.1968  DIFD 


DIFA  =  (H-L)  where  L  =  E  X.  ,  i  =  1  to  n/2 

and  H  =  Z  X.  ,  i  =  (n/2)  +  1  to  n 

DIFB  =  (H-L)  where  L  =  Z  X.  ,  i  =  1  to  (n-l)/2 

H  =  Z  X.  ,  i  =  (n+3)/2  to  n 

DIFC  =  (H-L)  where  L  =  X 

H  =  X 

n 

DIFD  =  (H-L)  where  L  =  X,  +  X2 

H  =  X   +  X,   -  x 
n     (n-1) 


2 

0.8865R 

3 

0.5907R 

4 

0.4857R 

5 

0.4299R 

6 

0..  3946R 

7 

0.3698R 

8 

0.3512R 

9 

0.3367R 

10 

0.3249R 
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EFFICIENCY  OF  ESTIMATORS 

The  efficiency  of  an  estimator  is  a  relative  criterion 
based  on  variance.   The  variance  of  an  estimator  is  the  mean 
squared  deviation  of  the  estimates  from  the  true  value  of 
the  parameter  and  the  most  efficient  estimator  of  a  given 
parameter  is  the  one  having  the  smallest  variance.   Efficiency 
is  defined  as  the  ratio  of  the  variances  of  the  sampling 
distributions  of  the  most  efficient  estimate  and  the  esti- 
mate being  compared. 

------„.--,-,   _  „    _  Variance  of  the  most  efficient  estimator 

hit  r  J-LJLijINL  Y  ii      ■    t: : -z — ,  ,  ,    . 7 ■ ———————— 

Variance  of  'che  estimator  being  compared 

Hence,  the  efficiency  of  the  most  efficient  estimator  is  1; 

less  efficient  estimators  have  an  efficiency  of  less  than  1. 

Relative  efficiencies  are  approximately  the  ratio  of 

sample  sizes  which  will  give  equal  precision  in  the  estimate 

Efficiency  of  Population  Mean  Estimators 

The  sample  mean  is  the  efficient  estimator  of  the 

population  mean.   The  variance  of  the  sampling  distribution 

2 

of  the  mean  is  o  /n .      From  the  definition  of  efficiency  it 

13 
follows    that  the  variance  of  the  sampling  distribution  of 

an  unbiased  estimator  of  the  mean  of  a  normal  population  is 

a2/nE . 

The  efficiencies  of  the  median  and  midrange  for 

various  sample  sizes  are  given  in  reference  14.   The 


12 
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efficiency  of  the  median  is  high  for  small  sample  sizes 
decreasing  to  a  value  of  0.637  as  n  approaches  infinity.   For 
the  midrange,  the  efficiency  is  also  high  for  very  small 
samples  but  decreases  rapidly  as  the  sample  size  increases, 
approaching  zero  as  n  approaches  infinity. 

By  comparison  of  the  sampling  distribution  of  the 
means  of  all  possible  combinations  of  two  values  from  a  large 
sample ,  it  can  be  shown  mathematically  that  the  estimator 
with  the  highest  efficiency  among  the  group  is  the  arith- 
metic average  of  the  28.6  percentile  value  and  the  71.4  per- 
centile  value.     The  25.0  percentile  and  the  75.0  percentile 
are  usually  used  in  practice  for  large  samples  because  they 
are  easier  to  remember  and  have  only  a  slightly  lower 
efficiency.   The  limiting  efficiency  of  this  modified  mid- 
range  combination  is  0.808  as  n  approaches  infinity.   For 
smaller  samples,  the  efficiency  of  the  Average  of  the  Best 
Two  increases  above  0,308.   For  sample  sizes  larger  than  four, 
the  efficiency  of  the  Average  of  the  Best  Two  as  an  estimator 
of  the  population  mean  is  always  greater  than  that  of  the 
median  or  unmodified  midrange.   The  estimators  and  effi- 
ciencies of  the  Average  of  the  Best  Two  for  various  sample 
sizes  are  given  in  reference  14.   Table  II  gives  the  esti- 
mators based  on  the  Average  of  the  3est  Two  for  samples  of 
size  two  through  ten.   It  also  compares  the  efficiencies  of 
the  median,  midrange  and  Average  of  the  Best  Two  as  esti- 
mators of  the  population  mean  for  these  same  sample  sizes. 
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TABLE  II 

EFFICIENCIES  OF  ESTIMATORS  OF  THE  POPULATION  MEAN 
COMPARED  TO  THE  SAMPLE  MEAN 


Sample 
Size 

Sample 
Median 

Sample 
Midrange 

Aver . 
Eff. 

of  Best  Two 
Estimator 

1.000 

1.000 

1.000 

h(x1 

+ 

x2) 

3 

0.743 

0.920 

0.920 

h(x± 

+ 

x3) 

4 

0.838 

0.838 

0.838 

h(x2 

+ 

x3) 

5 

0.697 

0.767 

0.867 

h(x2 

+ 

V 

6 

0.776 

0.706 

0.865 

h(x2 

-f. 

x5) 

7 

0.679 

0.654 

0.849 

h(x2 

J- 

X6} 

8 

0.743 

0.610 

0.837 

h(x2 

-'- 

X6> 

9 

0.669 

0.572 

0.843 

h(x3 

+ 

x7) 

10 

0.723 

0.539 

0.840 

h(x3 

+ 

Xg) 

Efficiency  of  Population  Dispersion  Estimators 

The  efficiencies  of  the  range,  average  deviation  and 
modified  linear  estimators  relative  to  the  square  root  of 
S   have  been  determined  and  published.     The  efficiency  of 
the  range  estimator  of  population  standard  deviation  is 
relatively  high  for  sample  sizes  of  five  or  less,  but 
decreases  to  0.85  for  a  sample  of  size  ten  and  to  0.70  for  a 
sample  of  size  twenty.   As  the  sample  size  increases  indefin- 
itely, it  approaches  zero.   The  efficiency  of  an  estimate 
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based  on  the  average  deviation  is  greater  than  that  of  an 
estimate  based  on  the  range  for  sample  sizes  larger  than  six. 
For  sample  size  ten,  it  is  0.S9.   An  estimate  obtained  from 
the  modified  linear  deviation  has  an  efficiency  equal  to  or 
greater  than  either  the  estimate  obtained  from  the  range  or  the 
estimate  obtained  from,  the  average  deviation  up  to  sample  size 
five.   For  larger  sample  sizes,  its  efficiency  is  consis- 
tently greater.   The  efficiencies  of  the  range,  average 
deviation  and  modified  linear  estimators  for  sample  sizes 
two  through  ten  are  given  in  Table  III. 

TABLE  III 

EFFICIENCIES  OF  ESTIMATORS  OF  POPULATION 
STANDARD  DEVIATIONS  AS  COMPARED  TO  S 


Sample 
Size 

Range 

A.D. 

Modified 
Linear 

Estimate 

2 

1.00 

1.00 

1.00 

3 

0.99 

0.99 

0.99 

4 

0.98 

0.91 

0.98 

5 

0.95 

0.94 

0.96 

6 

0.93 

0.90 

0.96 

7 

0.91 

0.92 

0.97 

8 

0.89 

0.90 

0.97 

9 

0.87 

0.91 

0.97 

10 

0.85 

0.89 

0.96 
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CHOOSING  STATISTICS  AND  ESTIMATORS 

The  proper  choice  of  which  statistic  or  which  esti- 
mator to  use  depends  upon  the  problem.   Again,  the  objective 
is  the  closest  economically  obtainable  answer  to  the  true 
value  being  sought. 

Quality  surveillance  at  the  command  level  initially 
seeks  to  detect  conditions  which  may  require  corrective 
action.   Answers  which  are  to  be  used  for  management  by 
exception  can  sacrifice  some  statistical  efficiency  for 
computational  efficiency. 

Central  Tendency 

The  arithmetic  mean  is  the  most  widely  used  measure 
of  central  tendency.   Perhaps  the  most  important  reason  for 
this  is  that  means  of  samples  of  uniform  size  tend  to  have  a 
normal  distribution  regardless  of  the  type  of  distribution 
of  the  population  from  which  the  samples  were  drawn.   This 
characteristic  of  the  sample  means  permits  the  use  of  the 
normal  distribution  in  making  probability  statements  about 
the  population  mean  with  full  confidence  even  if  the  distri- 
bution of  the  population  is  unknown  or  uncertain.   The 
arithmetic  mean,  being  based  on  all  the  data,  draws  the  maxi- 
mum amount  of  information  from  the  sample.   At  the  same 
time,  it  is  affected  by  extreme  data,  a  significant  dis- 
advantage when  sample  size  is  small  and  the  sample  mean  is  to 
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be  used  as  an  estimate  of  the  population  mean.   Such  is  the 
case  when  the  central  tendency  value  of  a  correlation  test 
sample  distributed  among  a  small  number  of  laboratories  is 
to  be  used  as  an  estimate  of  the  true  value  of  the  property- 
measured.   It  is  obviously  important  to  exclude  extraneous 
values  from,  the  computation  of  the  sample  mean  in  such  cir- 
cumstances . 

The  sample  median  is  a  less  efficient  estimator  of 
the  population  mean  when  both  the  median  and  the  arithmetic 
mean  are  computed  from  the  same  number  of  observations.   For 
sample  size  ten,  for  example,  efficiency  of  the  median  is 
0.723.   The  median,  however,  has  the  advantage  that  it  is 

not  seriously  affected  by  the  retention  of  extreme  values  in 

17 
a  sample.     Its  efficiency  in  utilizing  available  data, 

therefore ,  is  one  hundred  per  cent  since  none  of  the  observa- 
tions need  be  discarded.   If,  as  the  result  of  a  test  for 
outliers,  three  extraneous  values  were  discarded  from  a  set 
of  ten  to  compute  the  arithmetic  mean  estimator  of  the  popu- 
lation mean,  the  efficiency  of  utilization  of  available  data 
is  only  seventy  per  cent.   An  approximation  of  the  relative 
efficiency  of  the  arithmetic  mean  and  the  median  as  estima- 
tors in  this  case  can  then  be  made. 

Overall  efficiency  of  arithmetic  mean:  0.70  (I. 000)  =0.700 
Overall  efficiency  of  median;   1.00  (0.723)  =  0.723 
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From  this  it  can  readily  be  seen  that  the  choice  of  the 
arithmetic  mean  as  estimator  does  not  guarantee  the  most 
efficient  estimate  in  every  case. 

In  the  same  vein,  it  must  be  remembered  that  although 
a  more  efficient  estimator  has  a  greater  statistical  chance 
of  being  close  to  the  true  population  parameter,  this  does 
not  guarantee  that  for  each  sample  a  more  efficient  estimate 
will  be  closer  to  the  parameter  than  a  less  efficient  esti- 
mate.  There  is  also  the  question  of  the  relative  effort  or 
difficulty  in  finding  the  mean  value  or  the  median  value. 
If  the  data  are  arranged  in  an  order  set  the  median  can  be 
located  quickly  regardless  of  the  sample  size.   For  small 
samples,  say  ten  or  less,  the  median  value  can  usually  be 
determined  by  inspection  relatively  quickly  even  if  the  data 
are  not  ordered.   Mathematically  however,  the  median  is  hard 
to  handle . 

The  midrange  is  a  good  measure  of  central  tendency 
for  five  or  less  observations  but  not  as  good  as  the  mean. 
For  sample  sizes  larger  than  five,  it  is  the  least  efficient 
estimator  of  the  population  mean.   Its  chief  merit  is 
simplicity  of  calculation  but,  being  the  average  of  the 
largest  and  smallest  values  in  a  set,  it  is  even  more  affected 
by  extreme  values  than  the  arithmetic  mean  and  the  same  tests 
for  extreme  values  are  required.   However,  the  midrange  is 
superior  to  the  mean  or  median  for  extremely  short-tailed 
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distributions.     The  Average  of  the  Best  Two  is  a  means  of 

artificially  creating  a  short-tailed  distribution  by  chop- 
ping off  the  most  widely  dispersed  values.   This  estimator 
offers  several  advantages.   Its  construction  is  such  that 
the  probability  of  being  significantly  affected  by  outliers 
is  relatively  small,  and  its  efficiency  relatively  high 
(0.840  for  sample  size  10).   Yet,  it  is  relatively  easy  to 
compute . 

Dispersion 

The  range  is  the  simplest  measure  of  general  vari- 
ability and  is  very  easy  to  compute.   If  the  sample  size  is 

small,  say  ten  or  fewer,  it  is  a  sensitive  measure  of  the 

19  20 
general  variability  of  the  population.   '     Since  only  two 

of  the  data  points  are  involved  in  the  calculation  of  the 

range,  it  in  no  way  expresses  the  variation  of  the  other 

values  lying  between  these  two  extremes.   Therefore,  the 

accuracy  of  the  range  estimate  of  dispersion  decreases  as 

sample  size  increases.   None  the  less,  the  range  is  an 

extremely  useful  statistic  for  small  samples  and  is  often 

used  in  quality  control  and  inspection  work. 

The  average  deviation  is  sensitive  to  the  variability 

of  the  population  regardless  of  the  size  of  the  sample  since 

it  is  based  on  all  the  data.   On  the  one  hand,  it  is  an 

obviously  reasonable  measure  of  variability  for  small  samples 
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because  it  is  simple  to  interpret  and  easy  to  compute.   On 

the  other  hand,  it  is  hard  to  handle  in  mathematical  analysis 

21 
owing  to  the  use  of  absolute  values.     There  is  a  tendency 

to  use  the  average  deviation  as  a  measure  of  general  vari- 
ability when  the  median  is  used  as  a  measure  of  central 
tendency  because  it  is  a  minimum  when  measured  from  the 

median.   For  a  normal  distribution,  the  standard  deviation 

i 7—  ]  9 

is  -%/ 77/2  or  1.253  times  the  average  deviation.     If  the 

average  deviation  is  known  from  historical  data,  the  standard 
deviation  of  a  measurement  can  be  estimated  from  this  rela- 
tionship. 

The  variance  and  the  standard  deviation  are  the  most 
efficient  of  the  estimators  of  population  dispersion.   They 
are  harder  to  compute  than  the  range  or  the  average  devia- 
tion but  are  much  less  affected  by  extreme  values  than  the 

range  and  are  mathematically  less  cumbersome  than  the  average 
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deviation . 


CHAPTER  IV 


ANALYSIS  BY  NUMERICAL  METHODS 


INTRODUCTION 

The  purpose  of  this  thesis  as  stated  in  Chapter  I  is 
to  investigate  methods  of  extracting  more  definitive  infor- 
mation concerning  the  reliability  of  the  participating 
laboratories'  test  results  from  correlation  test  data. 

Some  statistical  methods  of  treating  available  cor- 
relation test  data  sets  which  will  accomplish  this  purpose 
are  examined  in  this  chapter.   These  methods  are  applied  to 
actual  data  and  the  results  are  interpreted. 

The  basis  of  single  observation  testing  is  presented 
first  and  its  limitations  are  pointed  out.   Next,  a  method 
of  analyzing  paired  sets  of  data  is  described  and  it  is 
shown  that  two  sets  of  observations  are  the  minimum  required 
to  estimate  the  consistency  of  a  laboratory's  results  using 
a  proven  method.   It  is  also  shown  that  further  analysis  is 
possible  but  is  dependent  upon  an  adequate  degree  of  pre- 
cision being  exhibited  by  the  two  observations. 

A  method  of  treating  multiple  sets  of  data  follows 
which  is  shown  to  produce  a  measure  of  the  reliability  and 
a  measure  of  the  systematic  error  of  a  laboratory's  test 
results  as  well  as  an  improved  measure  of  the  relative 
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accuracy  of  results.   Two  laboratory  rating  methods  are 
described  which  could  be  used  as  supplements  to  the  Summary 
of  Laboratory  Performance  described  in  Chapter  I.   One  method 
provides  an  index  of  accuracy  and  an  index  of  precision  for 
specific  tests.   The  other  provides  a  laboratory  ranking 
index  for  the  family  of  tests  associated  with  a  given  product. 

The  manner  of  presentation  of  each  of  the  methods  for 
analyzing  the  correlation  test  data  is  to  discuss  the  theory 
and  then  describe  the  procedure.   The  procedural  descrip- 
tion includes  illustrative  computations  using  actual  cor- 
relation test  data  obtained  from  a  major  military  command. 

Terminology  used  in  connection  with  the  reliability 
of  laboratory  test  results  is  defined  in  Chapter  II. 

The  statistical  measures  applied  are  those  discussed 
in  Chapter  III.   Analysis  of  laboratory  test  results  is  not 
only  a  problem  of  statistical  estimation  but  also  a  problem 
of  hypothesis  testing.   The  statistical  tests  applied  in  this 
chapter  have  not  themselves  been  discussed  previously  in 
this  thesis  except  for  tests  of  extreme  values,  but  they  use 
the  same  statistics  discussed  in  Chapter  III.   The  tests  will 
be  described  as  they  are  introduced  into  the  problem. 
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TESTING  SINGLE  OBSERVATIONS 

Discussion 

Accuracy  limits.   A  minimum  of  two  sets  of  observa- 
tions are  required  to  establish  an  estimate  of  the  precision 
of  a  test  method.   These  can  be  repeated  tests  by  the  same 
operator  using  the  same  equipment  to  establish  the  operator- 
equipment  precision  (repeatability)  of  the  test  method,  or 
paired  duplicates  from  separate  laboratories  to  establish 
the  interlaboratory  precision  (reproducibility)  of  the 
method.   Once  established,  the  repeatability  amount  and 
reproducibility  amount  can  be  used  to  check  the  accuracy  of 
a  single  observation  when  the  true  value  of  the  property 
being  measured  is  known  or  can  be  estimated. 

Let  d  represent  the  mean  difference  between  pairs  of 
test  measurements. 

?  (XAj  "  XBj} 

d  = (4-1) 

n 

, It  can  be  shown  that  the  mean  difference  between 

pairs,  d,  is  (2/V~n)    times  the  standard  deviation.     3y 

transposing  terms,  an  expression  is  obtained  for  computing 

the  standard  deviation  of  a  single  measurement. 

a    =  dY""^~   =  0.8862d  (4-2) 
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A  confidence  interval  to  the  true  value  of  the  prop- 
erty being  measured  can  then  be  established  around  the 
single  observation.  X.  .. 

Confidence  range  for  p,  =  X  -  zo  (4-3) 

Assuming  that  the  single  measurement.  X.  .,  comes  from  a 

1  j 

normally  distributed  population  of  similar  measurements 
affected  by  a  large  number  of  small  random  factors,  z  is  the 
normal  deviate  appropriate  to  the  desired  confidence  level. 
The  term,  -  zo,  is  the  tolerance  set  on  the  precision  of 
measurement  X.   Therefore,  if  d  is  known  or  can  be  determined, 
the  accuracy  of  a  single  measurement  can  be  estimated  cor- 
responding to  a  predetermined  degree  of  confidence. 

Accuracy  limits  for  X  =  p,  -    zo 

=  p,  ±  z(0.8862d)  (4-4) 

The  value  of  z  at  the  five  per  cent  probability  level 
is  1.96. 

Accuracv  limits  for  X~  nc  =  p.  -  ( 1.96 )( 0.8862 )d 

=  p  ±    1.74d 

These  limits  can  also  be  expressed  as  a  ninety  five 
per  cent  accuracy  confidence  interval  for  a  single  observa- 
tion, X.  .. 

1J 

(p,  -  zo)    <X<  (p,  +  zo)  (4-5) 

(p.  -  1.74d)  <  XQ  g5  <  (p,  +  1.74d) 
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This  interval  can  then  be  used  to  test  the  hypothesis 
that  the  single  observation  X.  .  is  statistically  the  same  as 
the  true  value,  p, ,  of  the  property  being  measured. 

Procedure 

Data  and  assumptions.   The  raw  data  required  are  the 
test  results  for  a  given  property  obtained  from  a  single 
sample  which  has  been  divided  and  distributed  among  the 
participating  laboratories.   Analysis  of  the  data  is  based 
upon  the  following  assumptions:   (A)  The  sub-divided  samples 
are  homogeneous,  that  is,  there  is  no  quality  variation  of 
the  material  distributed  to  the  various  participating  lab- 
oratories, (B)  The  universe  of  observations  for  each  labora- 
tory and  all  laboratories  is  normally  distributed,  (C)  The 
test  procedure  has  been  proven,  that  is,  it  is  adequately 
described  to  preclude  general  misinterpretation  of  the  exact 
procedures  to  be  followed. 

For  example,  the  following  single  measurements  were 
submitted  as  the  API  Gravity  of  aviation  gasoline  sample 
63-1700  by  the  ten  participating  laboratories  in  a  cor- 
relation test. 


TABLE  IV 

MEASUREMENTS  OF  API  GRAVITY  OF  AVIATION  GASOLINE 
SAMPLE  63-1700  BY  TEN  LABORATORIES 


4     5     6     7     8     9     10 
Test 

API  Grav.  69.8  69.1  69.6  69.1  69.1  69.2  69.2  69.2  69.4  69.2 


Decision  rule ;  accuracy.   Compute  the  estimated  true 
API  gravity  of  the  gasoline  using  the  sample  arithmetic 
mean  as  the  estimator.   Substituting  in  (3-3): 

The  ASTM  reproducibility  amount,  R.A.,  described  in 
Chapter  II,  can  be  substituted  for  the  ninety  five  per  cent 
confidence  interval  range,  -  za,  in  (4-4)  as  a  standard  to 
test  the  statistical  accuracy  of  the  single  test  result 
obtained  by  each  laboratory.   (4-5)  then  becomes: 

I A    R.A  J   __         /A    R.A.  \  ,  A    r\ 

f    ~   "2—  <  Xq.95  <   p.  +  "J-  (4-6) 

and  the  decision  rule  is: 

If  the  observed  value  is  between  the  estimated 
population  mean  minus  one  half  of  the  ASTM  Reproduc- 
ibility amount  and  the  estimated  population  mean  plus 
one  half  the  ASTM  Reproducibility  amount,  conclude 
that  results  obtained  by  the  laboratory  for  this  test 
are  statistically  accurate.   If  the  observed  value 
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lies  outside  these  limits,  conclude  that  results 
obtained  by  the  laboratory  for  this  test  have 
errors  attributable  to  assignable  causes  with  a 
five  per  cent  risk  of  being  wrong. 
Determine  the  ASTM  Reproducibility  amount,  R.A.,  from  the 

Standard  Method  of  Test  for  API  Gravity  of  Petroleum 

24 

Products,  ASTM  Designation:   D287-55. 

R.A.  =  0.5  degrees  API 
Compute  the  ninety  five  per  cent  confidence  limits: 

b    ±   ^y^-  =  69.3  ±  0.25 

At  the  ninety  five  per  cent  confidence  level,  test 

the  hypothesis  that  the  API  Gravity  measurement  X.,  reported 

by  laboratory  j,  is  statistically  the  same  as  the  true  API 

Gravity  of  the  sample.   Substituting  in  (4-6): 

69.05  <  X.  <  69.55 
j 

If  the  X.  is  between  69.05  and  69.55  accept  the 
J 

hypothesis  and  conclude  that  results  obtained  for  this  test 
by  laboratory  j  are  statistically  accurate.   If  the  X.  is 
less  than  69.05  or  mere  than  69.55  reject  the  hypothesis 
and  conclude  that  results  obtained  for  this  test  by  lab- 
oratory j  have  errors  attributable  to  assignable  causes. 
The  hypothesis  is  rejected  for  two  values: 
X,  =  69.8  >  59.55 

X3  =  69.6  >  59.55 
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The  distorting  effect  of. outlying  data  on  estimates 
of  population  parameters  was  discussed  in  Chapter  III  and  a 
trial  and  error  method  of  eliminating  outliers  from  compu- 
tation of  the  mean  was  described.   Applying  this  method: 

10 

Zj  X  .  —  X,  —  X~ 

A  -  J   J            -  553.5  _  ftQ  - 
M.  " g -  — 69.2 

Compute  new  ninety  five  per  cent  confidence  limits: 

H  ±  ^~^-   =  69.2  ±    0.25 

Substitute  in  equation  (4-6)  and  retest  the  hypothesis  for 

all  ten  measurements  X.: 

3 

68.95  <  X  .  <  69.45 
J 

The    hypothesis    is    rejected   for    the    same    two   values: 
X,     =    69.8    >    69.45 

X3    =    69.5   >    69.45 

Since  no  additional  data  points  were  rejected  and  none 
previously  rejected  were  picked  up,  a  stable  set  of  values 
has  been  determined. 

This  is  the  method  presently  used  to  evaluate  cor- 
relation test  results.   It  has  been  previously  pointed  out 
that  this  method  gives  no  indication  of  whether  systematic 
errors  or  mistakes  are  the  causes  of  out-of -control  observa- 
tions . 
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TESTING  PAIRED  OBSERVATIONS 


Discussion 

Just  as  a  minimum  of  two  sets  of  observations  were 
required  to  establish  an  estimate  of  the  precision  of  a  test 
method,  two  observations  are  the  minimum  data  required  to 
estimate  the  consistency  of  a  laboratory's  results  using  a 
proven  method. 


Precision  Limits 

Two  observations  can  be  analyzed  for  precision  by 
estimating  the  standard  deviation  from  the  mean  difference 
between  pairs.   Precision  limits  for  \i : 

p,  =  X  ±  zo  (4-3) 

Confidence  interval  for  X: 

(u.  -  zo)    <   X  <  U  +  zo)  (4-5) 

Let  the  confidence  range,  -  zo,  which  is  constant  for 

a  given  probability  level,  be  represented  by  the  symbol  2C. 

The  paired  test  results  from  one  laboratory  are  represented 

by  X.  .  and  X^  ..   The  sample  mean,  X.,  is  the  estimator  of  the 
1      Aj       Bj  J' 


population  mean.   Then: 


(X.  -  c)  <  X.  .  <  (X,  +  c) 

J       -   A  j  —    j 


(4-7) 


Substituting  for  X.: 

J 


XAj  *  X3-" 
2 


i  +  XBj 

-  C  <  XA  .  <   ^r £J- 

—      Aj  —       2 


C 


(4-8) 


- 
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Clearing  fractions: 


(XA.  *  XBj  -2C)  <  (2XA.)  <  (XA.  +  XB. 


2C) 


Subtracting  X, 


(XBj  -  2C)  <  (XAj)  <  (XBj  +  2C) 


Subtracting  X 


Bj 


(-2C)  <  (XAj  -  XBj)  <  (+2C) 


Transposing: 


(X,  .  -  X„  .)  <  -    2C 

Aj     Bj   — 


(4-9) 


Therefore : 


XAj  -  XBj   <  2C  <  2za 


(4-10) 


Likewise : 

(X  .  -  C)  <  X^  .  <  (X  .  +  C) 

Substituting  for  X.,  clearing  fractions,  subtracting  X,  .  and 
j  .,  Aj 

X_  . ,  and  transposing  terms: 
(XBj  "  XAj'  £  ±    2C 


And 


X_  .  -  XA  ..  <  2C 

Bj     Aj|  - 


But: 


X_  .  —  X,  .      X,  .  —  x._  . 

Bj     Ajj     |  Aj     Bj 


Therefore : 


X.  .  -  X„ J  <  2C  <  2za 

Aj     Bj   - 


(4-11) 
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At  the  ninety  five  per  cent  confidence  level: 
j>'  .  -  X„  .   <  2(1.74)d  <  3.48d 

This  range  limit  can  then  be  used  to  test      hypoth- 
esis that  a  pair  of  observations  (X,  .,  X„  . )  are  s-ca-cistically 

Aj    Bj  •* 

one  and  the  same  value.   If  they  are,  further  statistical 
inferences  may  be  drawn  from  them. 

Estimating  systematic  error .   If  the  two  observations 
from  a  laboratory  show  an  acceptable  degree  of  precision,  an 
estimate  can  be  made  of  the  amount  and  direction  of  systematic 
error  or  bias  which  they  contain. 

(xA  .  -  x.  )  -■-  (xR.  -  x  ) 

BIAS  =  £J - 2J ^_  (4-12) 

or,    for    simpler    calculation, 

BIAS     =     (XAi     *X3i>  <*A    +    V  (4_13)     , 

BIAS  2  ^ 

=    X  .    -    X 

J 

Although  constant  factors  may  be  present  in  measure- 
ments which  are  not  statistically  precise,  there  is  a  high 
probability  that  either  or  both  of  the  measurements  also 
contain  errors  caused  by  mistakes  of  unknown  magnitude  and 
direction.   A  'bias'  computation  would  be  meaningless  in  such 
circumstances,  could  only  cause  confusion  and  should  not  be 
made  . 
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■Accuracy  limi'ts .   The  standard  deviation  of  the  means 
of  samples  of  size  n  is  estimated  by  dividing  the  estimated 
population  standard  deviation  by  the  square  root  of  n.   One 
possible  way  to  define  accuracy  in  a  normally  distributed 
population  is: 

X  <  p.  ±    z|-fM  (4-14) 

-       (Vn  J 

where : 

a  =  the  true  value  of  the  property  being  measured 

a  =  the  population  standard  deviation 

n  =  the  sample  size 

z  =  the  normal   deviate   for  the  desired  level  of 

confidence 

x  -  S 

n 

But,  once  again,  the  proper  choice  of  a  statistic  or 
estimator  is  dependent  upon  the  available • data  and  the 
intended  purpose  for  which  it  is  to  be  used.   For  small  samples 
acceptance  of  the  hypothesis  that  the  sample  mean  and  the  true 
value  of  the  property  being  measured  are  statistically  one 
and  the  same  value  on  the  basis  of  the  above  test  may  occur 
even  when  the  situation  is  not  true.   For  example,  assume  two 
observations  are  obtained  of  a  property  whose  true  value  is 
zero.   It  can  be  readily  seen  that  regardless  of  magnitude, 
if  the  two  observations  have  the  same  value  but  opposite  signs 
the  average  will  be  zero.   Although  the  mean  of  the 
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observations  would  be  statistically  "accurate"  the  hypoth- 
esis test  is  obviously  meaningless  for  such  a  situation. 

One  simple  way  to  handle  this  dilemma  is  to  tie  the 
accuracy  determination  to  the  precision  test.   The  hypoth- 
esis test  for  accuracy  of  a  laboratory's  test  results  would 
then  be  modified  to  the  extent  that  X  would  be  redefined  as 
the  average  of  a  set  of  laboratory  test  results  which  are 
statistically  precise  at  some  specified  level. 

Procedure 

Data .   The  raw  data  required  are  the  test  results 
for  a  given  property  obtained  from,  two  samples  which  have 
been  divided  and  distributed  among  the  m  participating  lab- 
oratories.  It  is  not  necessary  that  both  samples  be  of  the 
same  product.   It  may  be  feasible  to  pool  test  results  of 
different  products.   Volk  states  that,  in  comparing  paired 
data,  the  pairs  do  not  have  to  be  measures  of  the  same 

thing,  but  the  individual  measurements  in  a  pair  will  be  made 

25 
at  the  same  conditions.     The  objective  is  to  avoid  intro- 
ducing additional  sources  of  variability.   Generally,  this 
objective  can  be  accomplished  if  the  test  procedures  are 

identical  and  if  the  samples  are  reasonably  close  in  the 

25 
magnitude  of  the  property  being  evaluated.     However,  even 

though  pooled  test  results  are  obtained  from,  statistically 

homogeneous  samples,  if  they  are  not  duplicate  tests  of  the 
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same  sample,  they  do  not  have  a  common  mean.   Consequently, 

the  observations  X,  .  and  X_  .  cannot  be  compared  directly. 

lj       2j  >■  * 

The  algebraic  deviation  from  the  mean,  v.  .,  for  each  obser- 
r>  '   1  j ' 

vation  must  be  determined  by  subtracting  the  mean,  X., 
computed  for  each  test  from  each  observation  reported  for 
that  test. 

v.  .  =  X.  .  -  X.  (4-15) 

ij     ij     i 

More  will  be  said  about  the  pooling  of  data,  to  form 
larger  samples  in  the  section  on  multiple  test  results. 

Correlation  test  results  of  the  ten  per  cent  distil- 
lation point  of  two  different  samples  of  aviation  gasoline, 
grade  115/145  provided  the  data  which  will  be  used  to 
illustrate  the  procedure.   These  two  sets  of  values  are 
given  in  Table  V.   The  results  labeled  as  Test  1  are  measure- 
ments taken  on  correlation  test  sample  64-27.   Those  labeled 
as  Test  2  are  measurements  taken  on  correlation  test  sample 
64-3599.   The  corresponding  matrix  of  observations,  v.  .,  is 
given  in  Table  VI. 
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Assumptions .   Analysis  of  the  data  is  based  upon  the 
following  assumptions:   (A)  The  sub-divided  samples  are  homo- 
geneous, that  is,  there  is  no  quality  variation  of  the 
material  distributed  to  the  various  participating  labora- 
tories for  each  test,  (B)  The  universe  of  observations  for 
each  activity  and  all  activities  is  normally  distributed; 
and,  (C)  The  test  procedure  has  been  proven,  that  is,  it  is 
adequately  described  to  preclude  general  misinterpretation 
of  the  exact  procedures  to  be  followed. 

Decision  rule :   precision.   The  ASTM  reproducibility 
amount,  R.A.,  described  in  Chapter  II,  can  be  substituted 
for  the  ninety  five  per  cent  confidence  interval  range  2C 
in  (4-11)  as  a  standard  to  test  the  statistical  precision  of 
the  pair  of  test  results  obtained  by  each  laboratory.   (4-11) 
then  becomes: 

|v,  .  -  v~  .  I  <  R.A.  (4-16) 

and  the  decision  rule  is: 

If  the  absolute  value  of  the  difference  between 
the  deviation  from,  the  test  means  of  two  independent 
measurements  is  equal  to  or  less  than  the  ASTM  repro- 
ducibility amount  for  the  test,  conclude  that  results 
obtained  by  the  laboratory  for  this  test  are  suf- 
ficiently precise,  i.e.,  errors  affecting  results  are 
probably  due  to  chance  causes  inherent  to  the  pre- 
scribed test  method.   If  the  absolute  difference  is 
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greater  than  the  ASTM  reproducibility  amount,  con- 
clude that  results  obtained  from  performance  of  this 
test  by  the  laboratory  have  errors  attributable  to 
assignable  causes  with  a  five  per  cent  risk  of  being 
wrong. 
Determine  the  ASTM  reproducibility  amount,  R.A.,  from,  the 

Standard  Method  of  Test  for  Distillation  of  Petroleum  Pro- 

27 
ducts,  ASTM  Designation:   D35-61. 

R.A.  =  7  °F 

For  the  m  laboratories,  compute: 

v,  .  -  v„ .    ,   i  =  1  to  m 
^J    •  2j| 

At  the  ninety  five  per  cent  confidence  level,  test 

the  hypothesis  that  the  ten  per  cent  distillation  point 

measurements  X?  .  and  X_  .  reported  by  laboratory  i  are  stat- 
1  j       2  j    £        J  J    J 

istically  the  same  in  respect  to  their  deviation  from  the 
true  values  of  the  ten  per  cent  distillation  points  of 

samples  1  and  2  respectively.   Substituting  in  (4-16): 

I 
v,  .  -  v„  .   <  7 

i   lj       2j:   - 
I 

If  the  absolute  difference  between  v,  .  and  v_,  .  is 

lj       2j 

equal  to  or  less  than  7,  accept  the  hypothesis  and  conclude 
that  results  obtained  for  this  test  by  laboratory  j  are  suf- 
ficiently precise.   If  the  difference  is  greater  than  7, 
reject  the  hypothesis  and  conclude  that  results  obtained 
for  this  test  by  laboratory  j  fail  to  meet  minimum,  standards 
for  precision. 
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I  1 

The  differences,   v,  .  -  v-  .  i . for  the  illustrative  test 

5  lj     2j|' 

results  are  tabulated  in  Table  VI.   The  paired  test  results 
from  all  laboratories  are  precise  according  to  the  estab- 
lished standard.   Consequently,  all  may  be  further  analyzed 
for  average  bias  and  for  accuracy. 

Bias  measurement .   The  mean  deviation  from  the  mean 
of  the  paired  test  results  reported  by  laboratory  j  is  deter- 
mined by: 

v,  .  +  v?  . 

v.  =  *■' ^  (4-17) 

J        2 

This  is  equivalent  to  (4-12)  for  the  bias  estimate  based  on 
two  observations  from  a  laboratory  which  shows  an  acceptable 
degree  of  precision.   The  values  v  .  computed  from  the  illus- 
trative data  appear  in  Table  VI.   These  values  will  be  further 
utilized  in  testing  the  accuracy  of  the  laboratories. 

Decision  rule :   accuracy.   A  test  for  accuracy  is 
given  by  (4-14)  in  which  X  is  defined  as  the  average  of  a 
set  of  laboratory  test  results  which  are  statistically  pre- 
cise at  some  specified  level.   Substituting  v.  for  X  and 
v.  .  for  u.  (4-14)  becomes: 

v.  ;  v.  .  ±  -££=  (4-18) 

3   -      ij    y  n 

But  v.  .  is  zero  by  definition.   Therefore  (4-18)  becomes; 
ij  ■* 

v  .  <  0  -  -—5-  (4-19) 

J  -         n 


D'J 

The  ASTM  reproducibility  amount,  R.A.,  can  be  sub- 
stituted for  the  ninety  five  per  cent  confidence  interval 
range  -  zo  in  (4-19)  as  a  standard  to  test  the  statistical 
accuracy  of  the  paired  test  results  obtained  by  each  labor- 
atory.  Also  substituting  for  n,  (4-19)  becomes: 

:-.\  ♦  jy^.  (4_20) 


"2    -   j  —    2 

and  the  decision  rule  is: 

If  two  single  observations  obtained  from  statis- 
tically homogeneous  sources  are  statistically  precise 
at  the  ninety  five  per  cent  level,  and  if  the  absolute 
value  of  the  average  variation  from,  the  mean  of  the 
paired  single  observations  is  within  the  ninety  five 
per  cent  confidence  range  based  on  the  applicable 
ASTM  Reproducibility  amount,  conclude  that  results 
obtained  by  the  laboratory  for  this  test  are  accurate. 
If  the  absolute  value  of  the  average  is  above  or  below 
the  ninety  five  per  cent  confidence  range,  conclude 
that  the  results  obtained  performing  this  test  con- 
tain errors  which  cannot  be  accounted  for  by  chance 
causes  with  a  five  per  cent  risk  of  having  reached 
the  wrong  conclusion. 

Compute  the  ninety  five  per  cent  confidence  limits: 

R  P  j-         7 

i    .      =  0  -   —~^   -  -  2.5 

2  ■  z  2   2 
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At  the  ninety  five  per  cent  confidence  level,  test 
the  hypothesis  that  in  regard  to  deviation  from  the  true  value 
of  the  property  measured,  the  average  of  a  pair  of  measure- 
ments is  statistically  the  same  as  zero.   Substituting  in 
(4-20) : 

-  2.5  <  v.  <  +  2.5 

If  the  v.  is  between  -2.5  and  +2.5  accept  the  hypoth- 
esis and  conclude  that,  on  the  average,  results  obtained  for 
this  test  by  laboratory  j  are  sufficiently  accurate.  If  the 
v.  is  less  than  -2.5  or  greater  than  +2.5,  reject  the  hypoth- 
esis and  conclude  that,  on  the  average,  the  results  obtained 
for  this  test  by  laboratory  j  have  errors  attributable  to 
assignable  causes. 

The  hypothesis  is  accepted  for  the  ten  laboratories 
in  the  example  but  laboratory  1  is  on  the  borderline. 

While  these  results  produce  a  quick  and  satisfactory 
indication  of  accuracy,  they  do  not  make  full  use  of  the 
available  information.   They  do  not  take  into  consideration 
the  probability  of  statistically  independent  events.   The 
outcome  of  either  of  two  separate  laboratory  tests  is  not 
conditioned  by  the  outcome  of  the  other.   Therefore  observa- 
tion A  and  observation  B  are  statistically  independent  and 
the  probability  of  both  A  and  B  occurring  is  the  product  of 
the  probability  of  A  occurring  and  the  probability  of  B 
occurring. 
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Pr(A  and  B)  =  (Pr  A) (Pr  B)  (4-21) 

The  hypothesis  test  employed  assumes  that  both  obser- 
vations (either  the  X,  .  and  X_  .  reolicate  measurements  or  the 

Aj       3j 

v,  .  and  v„  .  single  measurements)  come  from  the  same  normally 
Aj       Bj 

distributed  population.   Therefore  the  distance  from  the 
population  mean  of  each  observation  can  be  expressed  in 
terms  of  multiples  of  the  population  standard  deviation, 
that  is,  the  normal  deviate,  z.   The  area  under  the  frequency 
distribution  curve,  bounded  by  the  interval  dz  which  includes 
z    measures  the  probability  of  obtaining  observation  A  in 
a  random  sample  as  shown  in  Figure  4-1.   Likewise,  the  area 
under  the  frequency  distribution  curve  bounded  by  the  inter- 
val dz  which  includes  z^  measures  the  probability  of  obtain- 
ing  observation  B  in  a  random  sample.   In  a  normal  distribution 


Area  =  Pr(B) 


Area  =  Pr(A) 


FIGURE  4-1 

THE  PROBABILITY  OF  C     "'ING  A  GIVEN  VALUE 
FROM  A  NOR     DISTRIBUTION 
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the  probability  of  obtaining  a  particular  value  of  z  dim- 
inishes as  z  increases.   Therefore,  the  probability  of 
obtaining  two  observations  out  in  one  or  the  other  tail  of 
the  distribution  due'  to  chance  causes  alone  is  very  small. 
Conversely,  the  probability  of  obtaining  two  observations 
close  to  the  population  mean  if  only  chance  causes  are 
affecting  the  measurements  is  relatively  high. 

Given  two  sets  of  results,  (z  ,  =  1.95,  z  .  =  CO) 

r\±  jD_L 

and  (z  -  =  1.95,  z  ~  =  1.95)  one  would  conclude  intuitively 
that  results  from  laboratory  1  are  more  apt  to  be  accur 
than  results  from  laboratory  2.   Indeed  it  can  be  shown 
that  if  a  finite  z-interval  of  0.02  is  substituted  for  dz, 
the  probability  of  obtaining  the  subset  of  measurements 
(z„,  z^, )  due  to  random  variation  is  more  than  six  and  a 

Ax     Bi 

half  times  as  great  as  the  probability  of  obtaining  the  sub- 
set (zA2,  zB2). 

Al  =  Pr  (1.95  <  z  <  1.97)  =  .0012 
Bl  =  Pr  (-.01  <  z  <  +  .01)  =  .0030 
Pr  (Al  and  Bl )  =  ( . 0012 ) ( .0030)  =  9.50(10-5) 
A2  =  Pr  (1.95  <  z  <  1.97)  =  .0012 
B2  =  Pr  (1.95  <  z  <  1.97)  =  .0012 
Pr  (A2  and  B2 )  =  (  .0012 )(. 0012 )  =  1.44(10~6) 
The  consequences  of  applying  this  rule  do  not  appear 
to  be  significant  enough  to  justify  the  considerable  extra 
effort  required.   However  the  overall  effect  should  be  noted. 
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Viewed  from,  the  standpoint  of  confidence  level,  the  prob- 
ability of  an  observation  A  greater  than  z^  __  and  an 

0.95 

observation  B  greater  than  z   __  is  (  0.05 )  •(  0. 05  )  or  0.0025. 
Therefore  the  decision  rule  carries  a  risk  which  varies 
from  0.05  to  0.0025  of  wrongly  classifying  an  "accurate" 
activity  as  "inaccurate."   Conversely,  the  risk  of  failing 
to  detect  an  "inaccurate"  activity  is  increased. 

TESTING  MULTIPLE  OBSERVATIONS 

Discussion 

Consider  the  results  of  n  tests  submitted  by  m  lab- 
oratories as  represented  by  the  matrix  of  Table  VII.   Assume 
that  the  universe  of  observations  for  each  test  is  normally 
distributed.   The  objective  is  to  determine  the  kind  and 
magnitude  of  variability  that  can  be  expected  to  be  included 
in  observations  made  by  a  given  laboratory.   Since  the 
measurement  quality  of  interest  is  variability,  the  first 
step  is  to  convert  the  data  to  measurements  of  variation  or 
algebraic  distance  from  the  true  value  of  the  property  being 
measured. 

For  each  test,  a  sample  mean,  X. ,  can  be  obtained 
which  can  be  used  as  an  estimator  of  the  population  mean.   If 
the  n  tests  were  duplicate  tests  of  homogeneous  samples 
taken  from  the  same  population,  the  test  means  would  be 
expected  to  cluster  around  a  single  value,  the  population 


TABLE  VII 

SYMBOLIC  MATRIX  OF  RESULTS  OF  n  TESTS 
SUBMITTED  BY  m  LABORATORIES 
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.Lab. 

Test  i 


m 


X 


11 


X 


21 


X 


X 


12 


22 


X 


X 


2j 


X 


1m 


X 


2  m 


Xil 


i2 


X.  . 


X. 

ira 


n 


X 


nl 


X 


n2 


X  . 


X 


nm 


mean,  p, .   The  average  mean,  X,  becomes  a  better  estimator  of 
the  population  mean  which  can  be  used  to  determine  the  alge- 
braic variation  from  the  mean,  v.  .,  of  each  of  the  n  times 
m  observations.   If  the  n  tests  were  not  duplicate  tests  of 
the  same  batch  of  product,  but  (A)  the  tests  were  identical 
in  procedure,  and  (B)  the  materials  tested  are  close  enough 
in  magnitude  of  the  property  measured  as  to  preclude  any 
significant  variation  in  the  random  error  due  to  material, 
the  test  results  can  be  compared  in  regard  to  variation  from 
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the  mean  but  do  not  have  a  common  mean.   The  v.  .  for  each 
observation  can  be  determined  only  by  subtracting  the  X.  com- 
puted for  each  test  from  each  observation  reported  for  that 
test.   A  new  matrix,  Table  VIII,  results. 

TABLE  VIII 

SYMBOLIC  MATRIX  OF  DEVIATION,  v.  .,  FROM 
ESTIMATED  TEST  POPULATION  MEAN 


Lab. 


Test 


m 


2 


v 


11 


V 


21 


v 


12 


v 


22 


v 


V 


Ij 

2j 


v. 


im 


v 


2  m 


v 


ll 


V 


i2 


v.  . 


V. 

im 


n 


v 


ni 


v 


n2 


v 


nj 


v 


nm 


Homogeneity  of  Variance 

By  pooling  data  sets  in  this  manner,  larger  samples 
are  available  for  estimating  the  variability  of  laboratory 
observations  resulting  in  potentially  better  estimates.   Only 
data  sets  having  statistically  homogeneous  variances  are 
really  comparable,  however.   A  statistical  test  was  devised 
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by  Bar tie tt  for  passing  judgement  in  such  cases.   If  n  sets 
of  data  are  available  with  varying  numbers  of  observations, 
m,  in  each  set,  the  statistical  parameter,  B; can  be  computed 
in  the  following  manner: 
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TABLE  IX 
BARTLETT'S  TEST  FOR  HOMOGENEITY  OF  VARIANCES 


Degrees 

Test      ?     of  2       2  2      "> 

Data    S     Freedom  fiSi    lnSi     filnSi      f~ 

Set      -    f.=(m.-l)  i 

11 

1  sx2      Cl  flSl2  lnSl2   f1lns12   l/fx 

2  s22  f2  f2s22     ir.s22      f2ms22     i/f2 


2  2        2  2 

S.       f.      f.S.    InS.     f.lnS.    1/f. 

i        i      li      i     ill 


n    S  2     •  f      f  S  2   InS  2    f  InS  2   1/f 
n        n  n     n   n      n 


TOTALS  f      Ef.S.2  Ef.lnS,  ' 

i  _ 


i 


Compute 


SZ  =  ir-i-  (4-22) 


r 


and:  flnS2  (4-2  3) 


then: 


3  =  £    (flnS   -  E  f.ln_S.z)  (4-24 

C  li 


The  value  of  B  may  be  computed  initially  without  ev     ting 
the  correction  factor,  C„   The  critical  value  of  B  at  the 
selected  confidence  level  may  be  read  from'  a  statistical 
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table  of  chi  square  available  in  most  statistics  texts  and 
handbooks,  entering  the  table  with  (n-1)  degrees  of  freedom. 
If  B  is  significant  at  the  selected  confidence  level,  i.e., 
exceeds  the  critical  value,  it  may  then  be  divided  by  the 
correction  factor,  C,  computed  as  follows; 

C  "  '-    +   3(n-l)  (4"25' 

If  the  corrected  value  of  B  is  also  significant  at  the 
selected  confidence  level,  reject  the  hypothesis  that  the 
sets  of  data  being  compared  have  the  same  variance. 

Analyzing  the  Data 

For  each  of  the  m  participating  laboratories,  an 
average  algebraic  variation  from  the  mean,  v.,  can  be  com- 
puted.   This  is  the  average  accuracy  error  and  constitutes 
a  point  estimate  of  the  magnitude  and  direction  of  the  system- 
atic error  or  bias. 

A  2 
An  estimated  population  variance,  a.  ,  also  can  be 

2 

compuated  for  each  activity,  using  S  .  as  the  estimator. 

This  is  a  measure  of  the  variation  in  the  point  estimate 
of  the  systematic  error  due  to  random  and  accidental  causes. 
Having  sufficiently  isolated  random,  systematic,  and 
accidental  errors  to  obtain  an  approximate  measure  of  each, 
a  judgement  can  be  made  concerning  laboratory  reliability, 
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by  comparing  the  measures  of  reliability  for  each  activity 
against  matching  standards. 

Procedure 

Data  and  assumption.   The  raw  data  required  are  the 
test  results  for  a  given  property  obtained  from  n  samples 
of  different  batches  of  product  which  have  each  been  divided 
and  distributed  among  the  m  participating  laboratories.   It 
is  not  necessary  that  all  samples  be  of  the  same  product. 
It  may  be  feasible  to  pool  test  results  of  different  pro- 
ducts.  The  considerations  in  this  regard  are  the  same  as 
for  paired  data.   When  doubt  exists,  a  statistical  test  for 
homogenity  of  variance  of  the  pooled  data  is  appropriate. 

Analysis  of  the  data  is  based  upon  the  same  assump- 
tions already  stated  for  paired  data. 

To  illustrate  the  procedure,  the  correlation  test 
results  used  are  the  measurements  of  API  Gravity  for  five 
different  products.   The  matrix  of  these  observations  is 
given  in  Table  X. 

Estimating  the  population  mean.   Since  the  several 
sets  of  test  results  are  not  repeat  measurements  of  the 
same  product  sample,  the  tests  do  not  have  a  common  mean. 
A  separate  estimate  of  the  population  mean,  u,  .  ,    must  be 
made  for  each  of  the  i  tests. 
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The  most  efficient  estimator  of  the  population  mean 

is  the  sample  arithmetic  mean.   Because  outliers  can  have  a 

significant  effect  on  the  arithmetic  mean  of  small  samples, 

an  appropriate  test  should  be  applied  to  any  values  which 

appear  extreme.   Dixon's  test  for  extreme  values,  described 

in  Chapter  II,  will  be  used  to  check  the  two  doubtful  values 

in  Table  X: 

X,  ,  -  69.2  and  Xcn  =  21.2 
ii  by 


Ra 


tio  test  symbol  r, ,  for  the  largest  extreme  applies  in 


both  cases.   The  critical  value  for  test  r    at  the  0.05 


level    is    0.477 
.1 


Check   Xn  , 


X10        X9    =    69.2-68.3    =    0^4    = 

X10    ~  X2         69.2-68.1         1.1         U-Jb- 

Since    the    ratio   does   not   exceed   the    critical    value    of    0.477 
accept    the   hypothesis    that   X,  ,    comes    from   the    same    popula- 
tion  as    the    other   results    submitted  for   Test    1. 
Check  X50    =    21.2; 

21-2    -    20.3    =    0^9    m 

21.2    -    20.1         1.1         u-°-^ 

Since  the  ratio  exceeds  the  critical  value  of  0.477  reject 

the  hypothesis  that  Xco  comes  from  the  same  copulation  as  the 

by  -  J- 

other  results  submitted  for  Test  5.   An  asterisk  is  used  to 

flag  X._0  as  an  outlier  in  the  tabulated  data, 
by 
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Compute  X,  for  each  of  the  n  tests  which  have  been 
pooled  for  the  analysis  and  use  these  values  as  estimates 
of  the  corresponding  population  means.  When  computing 
mean  for  Test  5,  exclude  X,-q  from  the  computation  to  mini- 
mize the  probability  of  distorting  the  estimated  true  API 
Gravity  of  sample  63-05.  The  arithmetic  mean  estimate  of 
the  true  API  Gravity  for  each  of  the  five  tests  is  tabulated 

in  column  X.  of  Table  X. 

l 

To  avoid  the  necessity  of  testing  for  outliers,  it 
may  be  desired  to  use  the  Average  of  the  Best  Two  rather 
than  the  sample  arithmetic  mean  as  the  estimator  of  the 
population  mean.   This  estimator,  discussed  in  Chapter  III, 
is  relatively  easy  to  compute  and  has  a  high  efficiency  for 
small  sample  sizes.   For  sample  size  10, 

X  =  Aver,  of  Best  Two  =  ^(x-,  +  x„ )  (4-26) 

J  o 

Where:   x_  =  the  X.  .  ranking  third  in  magnitude  amoncr 
3         ij  3 

observations  for  test  i. 

x0    =   the    X.   .    ranking   eighth   in   magnitude    amoncr 
8  i  j 

observations  for  test  i. 
The  Average  of  the  Best  Two  estimate  of  the  true  API 
Gravity  is  also  tabulated  in  Table  X  for  comparison  with  the 
arithmetic  mean  estimate.   Both  values  are  identical  for  four 
of  the  tests  and  are  separated  by  only  0.1  degree  API  for 
Test  1. 
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Computing  the  matrix  of  deviations  from,  the  mean.   Sub- 
tract X.  from  each  of  the  m  observations  submitted  for  test  i 

1 

to  obtain  the  values  v.  .  which  measure  the  alqebraic  devia- 

tion  of  each  observation  from  the  estimated  population  mean. 

v.  .  =  X.  .  -  (4-13) 

lj     lj     i 

The  resulting  matrix  of  values  for  the  illustrative  tests 

is  given  in  Table  XI. 

Testing  for  homogeneity  of  variance .   Determine  the 
estimated  population  variance  for  each  test  using  an  unbiased 


estima-o: 


S2  "  S2  (3-17) 

2      n  m         -   2  ' 

S-i     T^T    S(v,  ,-v.  )   !  (3-18) 

Lj 


A  simpler  computational  form  is: 


\2  ! 


2       1  m  2 

L  :       '         : 


2 
The  values,  S.  ,  of  the  estimated,  population  variance  for 

each  of  the  five  illu     :ive  tests  are  given  in  Table  XII. 
Again  it  may  be  desired  to  use  a  short-cut  method  of 
computation.   The  Modified  Linear  Estimator  of  the  popula- 
tion standard  deviation  described  in  Chapter  III  was 
characterized  as  being  relatively  easy  to  compute  and 
having  a  high  efficiency.   For  Tests  1  through  4  the 
Modified  Linear  Estimator  for  sample  size  10  is: 
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g.  =  0.1968  (xin  +  xQ  -  xn  -  x0)  (4-28) 

Where:   x,  ,  x~  ,  x~  and  x, n  are  the  first,  second,  ninth  and 

tenth  values  ranked  in  order  of  magnitude  from,  smallest  to 

largest.   Extreme  values  have  a  significant  effect  on 

estimates  computed  from  the  Modified  Linear  Estimator. 

Observation  X,.-.  should  therefore  be  excluded  from  the  compu- 

tation  of  the  estimated  population  standard  deviation  of 

Test  5,  reducing  the  sample  size  to  9.   The  Modified  Linear 

Estimator  for  sample  size  9  is: 

o.     =    0.2068  (xft  +  x„  -  x,  -  x~)  (4-29) 

l  y    o    j.    z 

Squaring  the  estimate  of  population  standard  deviation 
obtained  from  these  computations  gives  an  estimate  of  the 
population  variance  of  each  of  the  five  tests.   The  results 
are  tabulated  in  Table  XII  for  comparison  with  the  efficient 
estimator  computed  by  equation  (3-18).   Agreement  is  reason- 
ably close  except  for  Test  3.   If  this  estimator  is  used  in 
connection  with  Bartlett's  test  it  is  recommended  that  any 
borderline  indications  of  homogeneity  or  non-homogeneity  of 
variance  be  rechecked  using  the  efficient  estimator  of  the 

population  variance. 

2 
Compute  S   from  (4-27): 

2    1  556 
S   =   a  a         =  0.0354 
44 


Then: 


f(ln  S2)  =  44  (-3.34)  =  -147.00 
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Compute  B  from  (4-24)  without  evaluating  the  correction 
factor,  C: 

B  =  £•  [  -147.00  -  (-174.40)]  =  ^  (27.40) 

Refer  to  a  statistical  table  of  chi-square .   Enter  the  table 
with  (n-1)  =  4  degrees  of  freedom  to  determine  the  critical 
value  at  the  ninety  five  per  cent  confidence  level. 

'  2  -  9.488 

The  value  of  B  exceeds, the  critical  value  indicating  that 

there  is  a  significant  difference  among  the  variances  of  the 

five  sets  of  test  data. 

Compute  correction  factor,  C,  from  (4-25): 

r  =  l  +  0-559  -  0.023  = 

u    --        3(5-1)       — --o 

Determine  the  corrected  value  of  B: 

B  ~  17045"  "  25'20 
Since  3  still  exceeds  the  critical  value  at  the  ninety  five 
per  cent  confidence  level,  reject  the  hypothesis  that  the  five 
sets  of  test  results  have  the  same  variance  and  conclude  that 
they  cannot  be  pooled  to  form  a  single  large  sample. 

Form  a  subset  of  four  tests  by  dropping  the  set 
exhibiting  the  most  extreme  variance  which  is  Test  1.   Test 
this  subset  for  homogen      of  variance. 
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.  f  =  £  f .  -  35 

1 

£  f . (S.2)  =  0.503 
11 

£  f . (In  S.2)  =  -153.35 

S2  -  ^^  -  0.0144 
3d 

f(ln  S2)  =  (35)(ln  0.0144)  =  -147.70 

B  =  ^  [-147.70  -  (-153.35)]  =  ~   5.65 

Entering  a  table  of  chi-square  with  (n-1)  =  3  degrees 
of  freedom,  determine  the  critical  value  at  the  ninety  five 
per  cent  confidence  level. 

X^  =  7.815 
Since  the  value  of  B  is  less  than  the  critical  value,  accept 
the  hypothesis  that  the  four  sets  of  test  results  have  the 
same  variance  and  conclude  that  they  are  comparable  and  can 
be  pooled.   The  new  matrix  is  given  in  Table  XIII. 

Estimating  bias .   Compute  the  average  algebraic 

deviation  from  the  mean.  v.  for  each  of  the  i  activities, 

J 

excluding  outliers  from  the  computation.  The  v  .  can  then  be 
used  as  a  point  estimate  of  the  magnitude  and  direction  of 
the  bias  in  results  reported  for  this  type  of  test  by  lab- 
oratory j.  The  reason  for  excluding  the  extreme  values  is 
that  they  were  previously  rejected  en  the  basis  of  a  hypothe- 
sis test  leading  to  decisions  that  they  probably  contained 
errors  due  to  mistakes.  Inclusion  of  these  mistakes  would 
distort  the  bias. 
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If  A.I.  .  is  positive,  the  laboratory  meets  the  minimum 
J 

standard  established  for  accuracy.   The  larger  the. value  of 
A.I.  .  the  higher  the  degree  of  accuracy.   If  A.I.  .  is  nega- 
tive, the  laboratory  does  not  meet  the  minimum  standard.   The 
larger  the  negative  value  is,  the  more  inaccurate  are  the 
results  obtained  by  the  laboratory. 

For  the  illustrative  example,  n  -  4  and  the  Repro- 
ducibility amount  given  in  the  Standard  Method  of  Test  for 
API  Gravity  of  Petroleum  Products,  ASTM  Designation: 
D  287-55  is  O.5.24   Substituting  in  (4-31): 

v  .  '  =  : =  0.125 

:    2  VT 

and,  substituting  in  (4-32): 

A.I.  .  =  ?'Jr2.5    -   1.0 
J    "  v  I 

The  I  v.   and  A.I.  .  for  each  of  the  ten  laboratories  is  com- 
I   J  I  J 

puted  and  tabulated  in  Table  XIII. 

Of  the  ten  laboratories,  only  laboratory  9  with  an 
accuracy  index  of  -0.5  failed  to  meet  the  minimum  standard 
for  accuracy  in  the  determination  of  API  Gravity  of  the  four 
products.   Of  the  nine  laboratories  which  are  above  the  mi 
mum  standard,  laboratories  3  and  4  each  with  an  accuracy 
index  of  +0.2  obtained  the  least  accurate  measurements  whi 
laboratory  8  reported  measurements  equal  to  the  estimated 
true  API  Gravity  for  all  four  products. 
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Analysis  of  the  data  for  precision.   A  measure  of  the 

variation  in  the  point  estimate  of  the  bias  is  the  popula- 

2 

tion  variance.   Use  S  .  ,  computed  by  substitution  in  (3-18) 

or  its  easier  computational  form  (4-27)  as  the  estimator  of 
the  variance  of  measurements  made  by  laboratory  j.   Include 
all  the  data  in  the  computation  because  the  objective  is  to 
determine  how  tightly  all  the  observations  reported  by  the 
laboratory  are  clustered.   If  the  objective  was  to  estimate 
the  precision  of  the  test  method  (as  it  would  be  if  the 
standard  was  being  tested)  extreme  values  would  be  excluded, 
again  pointing  out  the  fact  that  the  proper  choice  of 
statistic  or  estimator  is  dependent  upon  what  one  is  trying 

to  measure. 

2 

Computation  of  the  variance  of  measurement,  S  .  ,  of 

J 
the  API  Gravity  of  the  four  products  of  the  example  is 

presented  in  tabular  form  in  Table  XIII. 

Again  using  the  ASTM  Reproducability  amount,  R.A., 
as  a  basis,  a  minimum  standard  at  the  ninety  five  per  cent 
confidence  level  can  be  established  for  the  relative  pre- 
cision of  test  results. 

?     I R  A  \2 
limum  St     :d  for  S,   -  !  '"'  (4-33) 


A  Precision  Index,  P.I.  .,  can  then  be  computed  for  each  lab- 

J 

oratory  as  follows: 

i  Standard  f or  S  . 
P.I.  .  = ]-  -1.0  (4-34) 

3  S  . 

J 


75 

If  P.I.  .  is  positive,  the  laboratory  meets  the  minimum 

standard  established  for  precision.   The  larger  the  value  of 

P.I.  .,  the  higher  the  degree  of  precision.   If  P.I.  .  is 
J  J 

negative,  the  laboratory  does  not  meet  the  minimum  standard. 
The  larger  the  negative  value  is,  the  less  precise  are  the 
results  obtained  by  the  laboratory. 

For  the  illustrative  example,  substituting  in  (4-33): 

2    /o  5\2 
Minimum.  Standard  for  S  .   =  F™-    =  0.C28 

j     \    3  | 

and,  substituting  in  (4-34): 

_  0.028  _ 
-j       2     1'° 

J 

Computation  of  the  P.I.  .  for  each  of  the  ten  laboratories  of 

J 

the  example  is  given  in  Table  XIII. 

Two  of  the  ten  laboratories,  laboratory  3  with  a 
P.I.  of  -0.4  and  laboratory  9  with  a  P.I.  of  -0.9,  failed  to 
meet  the  minimum,  standards  for  precision  in  determination 
of  the  API  Gravity  of  the  four  products.   Measurements 
obtained  by  laboratory  9  were  the  least  precise  while  those 
obtained  by  laboratory  8  were  the  most  precise. 

Interpretation  of  Analysis  Results 

Accuracy/mistakes .   Relative  freedom  from  mistakes  is 
determined  by  the  simple  inspection  of  incidence  of  extreme 
values  among  observations  reported  by  the  laboratory.   An 
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excessive  number  of  mistakes  indicates  possible  carelessness. 
In  a  laboratory  with  more  than  one  operator  or  more  t     one 
set  of  equipment,  it  ray  reflect  a  difference  in  systematic 
error  among  the  tests.   Since  mistakes  are  due  to  assign- 
able causes,  the  established  standard  for  true  mistakes 
should  be  zero.   However,  since  observations  are  classified 
as  mistakes  on  the  basis  of  a  statistical  decision  rule  which 
carries  a  risk  of  making  a  wrong  decision,  no  stigma  she; 
accompany  infrequent  occurrences  of  "mistakes."   For  example, 
a  decision  rule  at  the  ninety  five  per  cent  confidence  level 
will  misclassify  one  chance  error  out  of  twenty  as  a  mis- 
take in  the  long  run. 

Accuracy/ s        ic  errors.   Relatively  poor  accuracy 

may  be  the  result  of  a.  systematic  error  or  errors.  T'r.^ 

estimated  bias,  v.,  provides  a  direct  i      :    it  of  the 

J   " 

magnitude  enC.   direction  of  e  possible  systematic  error.   A 
large  bias  may  reflect  a  local  modification  to  the  test 
method,  either  intentional,  or  accidental  by  reason  of  mis- 
interpretation.  It  may  also  indicate  a  measurii     strument 
out  of  calibration  for    ;   reason. 

Accuracy/ore c  i  s  i on .   R e 1 a t  i ve 1 y  poor  sing 1 e  me  a s u r e - 
ment  accuracy  may  result  frc       Lively  poor  precision. 
When  relatively  poor  precision  is  indicated  it  may  be  due 
to  (A)  excessive  v     :ion  in  the  response  of  a  measuri: 
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instrument,  (B)  failure  to  strictly  conform  with  the  pre- 
scribed test  method,  or  (C)  carelessness  producing  frequent 
minor  mistakes  in  a  random  pattern. 

Application  to  the  illustrative  problem.   In  the  illus- 
trative example,  examination  of  the  data  indicates  a  single 
gross  blunder  as  the  probable  cause  of  the  failure  of  lab- 
oratory 9  to  meet  the  mi     im  standard  for  accuracy.   There 
is  no  convincing  evidence  of  a  significant  bias  error  affect- 
ing measurements  and  three  of  the  four  measurements  appear 
free  of  mistakes. 

Laboratory  3  meets  the  minimum  standard  for  accuracy 
but  not  for  precision.   Poor  precision  could  result  in  poor 
accuracy  of  any  single  measurement  and  the  laboratory  should 
review  the  test  method  to  insure  that  it  is  being  strictly 
followed. 

Laboratory  4  is  within  limits  of  both  precision  and 
accuracy  but  shows  an  apparent  bias.   Since  bias  is  due  to 
assignable  causes,  the  laboratory  should  attempt  to  discover 
the  cause  and  eliminate  it. 

LABORATORY  RANKING  INDEX 

Discussion 

An  index  for  indicating  the  relative  reliabilr'T  y  of  a 
laboratory  in  the  performance  of  a  specified  test  on  a  given 
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product  or  homogeneous  group  of  products  was  described  in 

preceding  section.   Laboratories  can  also  be  rated  according 

to  their  relative  reliability  in  performance  of  the  family 

of  tests  associated  with  a  single  product.   This  would  be  a 

useful  refinement  on  the  Summary  of  Laboratory  Performance 

described  in  Chapter  I,  in  that  it  would  supply  a  direct 

performance  standard  for  command  personnel  in  evaluating 

laboratories  under  their  jurisdiction.   To  provide  the  most 

efficient  indication  of  operational  effectiveness  to  the 

military  commander,  consideration  should  be  given  to  the 

fact  that  certain  properties  of  each  product  have  greater 

significance  in  regard  to  the  operational  performance  of 

the  product  than  other  properties.   This  importance  can  be 

recognized  by  assigning  weighting  factors  to  each  test. 

The  measure  of  relative  accuracy  common  to  ail  tests 

is  the  normal  deviate,  z.  ..   An  a^oropriate  Laboratory  Rank- 

ing  Index,  LRI  .,  for  laboratory  j  then  would  be  the  total  of 

the  weighted  z. 's  computed  for  each  of  the  n  tests. 
1 

n 

LRI  .  =  E  w.  z.  .  (4-35) 


Where : 


w.  =  the  weighting  factor  for  test  i  determined  by 
the  relative  significance  of  that  test  to  the 
operational  performance  of  the  product 
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And:   ,              ■ 
x.  .  -  u . 
Z.  .  =  -3J (4-36) 

1J       A 

Gi 

The  w. 's  are  arbitrarily  chosen  as  oositive  and  if 
these  factors  are  normalized,  i.e.  E  w.  =  1,  the  Laboratory 
Ranking  Index  will  have  the  same  units  as  z  and  will  repre- 
sent a  weighted  average. 

Tests  which  are  not  adaptable  to  inclusion,  notably 
those  which  require  qualitative  rather  than  quantitative 

observations  such  as  the  test  for  copper  strip  corrosion  by 

28 

petroleum  products,    can  be  excluded  from  determination  of 

the  Laboratory  Ranking  Index  by  assigning  a  weighting  factor 
of  zero. 

Procedure 

Data  and  assumptions.  The  raw  data  required  are  the 
results  (for  a  sample  of  a  given  product)  of  all  tests,  n 
in  number ,  performed  on  the  product  at  each  of  m  labora- 
tories. The  same  assumptions  made  in  preceding  sections 
of  this  chapter  regarding  homogeneity  of  the  sub-divided 
samples,  normal  frequency  distributions  of  observations, 
and  proven  test  procedures  apply. 

The  procedure  for  determining  the  ranking  index  for 
each  laboratory  will  be  illustrated  utilizing  correlation 
test  data  reported  for  sample  54-31  of  Ashless  Dispersa 
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Aircraft  Lubricating  Oil.   It  is  arbitrarily  assumed  that 
only  five  tests  have  been  assigned  a  non-zero  weighting 
factor.   These  five  sets  of  test  results  and  non-significant 
weighting  factors  assigned  for  illustrative  purposes  only 
are  listed  in  Table  XIV. 

Computing  the  norma 1  deviate .   One  estimates  the  true 
value  of  the  property  for  each  test.   Extreme  values  result- 
ing from  bias  errors  or  mistakes  must  be  excluded  from  the 
computation.   Test  suspected  outliers  by  Dixon :s  ratio  test 
[equation  (3-15)  or  (3-16)]  and  use  the  arithmetic  mean  as 
the  estimator  of  \i .   As  an  alternative,  the  Average  of  the 
Best  Two  estimator  of  \j, ,  taken  from  Table  II,  can  be  used  to 
facilitate  computation, 

Su  spe c te d  extreme  va lue s  in  the  i 1 lu s t r a t ive  da t a  of 

Table  XIV  were  tested  by  Dixon's  method  and  the  observation 

0.232  submitted  by  laboratory  1  for  test  5  (Carbon  Residua) 

was  rejected  as  significant  at  the  ninety  five  per  cent 

confidence  level.  The    arithmetic  mean  estimates  of  p.  are 

shown  in  the  table. 

One  computes  the  algebraic  deviation,  v.  .,  from  the 

'   i  j 

mean  of  test  i  and  divides  by  the  est:     5  standard  devia- 

A 

tion  of  the  population  of  laboratory  test  results,  a.,  to 
determine  the  normal  deviate,  z.  ..   The  efficient  estimator 
of  the  standard  deviation  cc      d  from  equation  (3-19)  may 
be  used.   If  it  is  desired  to  simplify  computation  by  the 
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use  of  one  of  the  less  efficient  estimators,  the  Modified 
Linear  estimator  given  in  Table  II  is  recommended. 

Values  of  v.  .,  a.    and  z.  .  computed  for  the  illustra- 
tive  data  are  shown  in  Table  XIV. 

Computing  the  rankincr  index.   The  Laboratory  Ranking 
Index,  LRI  . ,  is  computed  from  equation  (4-35N.   The  labc 
tory  with  the  smallest  LRI  is  the  most  accurate  in  the  over- 
all measurement  of  the  product's  properties. 

The  LRI ' s  for  Aircraft  Lubricating  Oil  computed  for 
the  ten  laboratories  in  the  example  are  shown  in  Table  XIV. 
L  boratory  6,  with  an  LRI  of  0.346,  ranks  best  among  the  ten, 
while  laboratory  1,  with  an  LRI  of  1.697,  ranks  lcwest.   One 
interpretation  that  can  be  given  to  this  relationship  is  that 
the  probability  that  laboratory  5  will  properly  classify  oil 
on  the  borderline  of  acceptability  as  the  result  of  a  single 
set  of  tests  is  con  si  e'er-     ligher  than  that  of  labora- 
tory 1. 
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CHAPTER  V 


ANALYSIS  3Y  A  GRAPHICAL  METHOD 


A  graphical  method  for  evaluating  new  laboratory  test 

29 

procedures  has  been  proposed  by  Youden.     This  method 

utilizes  the  median  as  a  measure  of  central  tendency.  As  a 
measure  of  variability,  it  utilizes  an  unbiased  estimate  of 
standard  deviation  based  on  the  mean  difference  of  paired 
results.  Using  this  technique  as  a  foundation,  a  graphical 
method  for  evaluating  the  relative  accuracy  and  precision  of 
a  group  of  testing  laboratories  utilizing  specified,  proven 
test  procedures  will  be  developed  in  this  chapter. 

Correlation  test  data  will  be  analyzed  by  this  method 
to  illustrate  the  potential  usefulness  to  a  military  com- 
mander exercising  quality  surveillance  over  a  group  of  widely 
scattered  laboratories. 

DISCUSSION 

In  the  target  analogy,  the  reliability  problem  was 
defined  as  one  of  consistently  coming  as  close  as  possible 
to  the  intersection  of  the  horizontal  and  vertical  hair- 
lines.  Assuming  the  unattainable  situatic      absence  of 
all  error,  laboratory  test  results  would  invariably  be  the 
true  value  of  the  property  being  measured.   However,  the 
existence  of  various  sources  of  error  has  been  acknowledged. 
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Consequently,  even  under  the  best  possible  circumstances,  the 
measurement  obtained  is  expected,  with  a  given  degree  of  con- 
fidence, to  be  only  one  of  an  infinite  number  of  values 
within  a  statistically  determinable  range. 

Assume  first  that  errors  do  exist  but  that  only  mis- 
takes or  systematic  errors  are  possible;  none  are  due  to 
chance  causes „   Relating  this  to  the  definitions  given  to 
precision  and  accuracy,  the  assumption  is  one  of  perfect 
precision  but  possibly  poor  accuracy.   The  true  value  of  a 
property  being  Treasured  can  be  represented  by  either  a  hori- 
zontal or  a  vertical  centerline.   An  observed  value  of  the 

-;perty  can  then  be  represented  by  a  point  at  a  perpendicu- 
lar distance  from  the  centerline., which  distance  measures  the 
inaccuracy  of  the  observation.   Such  a  representation  is 
illustrated  in  Figure  5-1. 

Assume  now  that  two  observations  are  to  be  made  of  the 
same  property.   The  first  observation  is  to  be  plotted  on  a 
horizontal  axis      the  second  is  to  be  plotted  on  a  vertical 
axis.   If  the  two  axes  are  overlaid,  a  graph  subdivided  into 
four  quadrants  as  shown  in  Figure  5-2  results.   The  quad- 
rants have  been  numbered  counterclockwise  from  I  to  IV  start- 
ing with  the  upper  right-hand  quadrant  in  the  conventional 
manner.   Let  the  true  value  of  the  property  being  measured 
be  zero  and  let  the  horizontal  axis  be  identified  as  the 
A-axis  and  the  vertical  axis  as  the  B-axis.   Both  axes  are 
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FIGURE    5-1 

DEVIATION    FROM    A    HORIZONTAL    OR    VERTICAL    AXIS 
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to  the  same  scale.   The  two  observations  w  sntified 

as  A  and  B  respectively. 

Recalling  that  chance  errors  are  impossible,  if  no 
mistakes  or  systematic  errors  occur  both  observations  will 
be  the  true  value,  placing  data  point  (A,B)  at  the  inter- 
section of  the  two  axes.   The  presence  of  only  a  system? 
error  will  result  in  data  point  (A,B)  appearing  in  either 
quadrant  I  if  the  error  causes  observations  higher  than  the 
true  value,  zero,  or  in  quadrant  III  if  the  error  causes 
observations  lower  than  the  true  value,  zero.   The  appear- 
ance of  a  data  point  (A,B)  in  quadrant  II  or  IV  results  fro~" 
one  observation  being  greater  than  and  one  observation  being 
less  than  the  true  value.   This  can  be  explained  only  on  the 
basis  of  a  mistake  since  systematic  errors  produce  a  con- 
stant bias  and  random  errors  have  been  disallowed. 

Now  discount  the  possibility  of  mistakes  as  well  as 
random  errors.   As  a  consequence,  data  points  can  occur  only 
in  quadrant  I  or  III  if  a  systematic  error  is  causing  a 
positive  or  negative  bias  respectively,  or  at  the  intersec- 
tion of  the  axes  if  there  is  no  systematic  error.   In  fact, 
since  the  systematic  error  has  a  constant  value,  the  locus 
of  all  possible  data  points  is  a  straight  line  passing 
through  the  intersection  of  the  A  and  3  axes  and  bisecting 
quadrants  I  and  III  as      i  in  Ficure  5-3. 
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FIGURE  5-3 
THE  LOCUS  OF  EXPECTED  VALUES  FOR  ALL  OBSERVATIONS  (A..B.) 

y   3 

AFFECTED  ONLY  BY  SYSTEMATIC  ERRORS 

The  locus  is  a  straight  line  through  the  intersections 
of  the  A  and  B  axes  bisecting  quadrants  I  and  III. 


As  the  next  step,  recognition  is  given  to  the  existence 
of  chance  causes  of  variation  which  will  cause  deviations  from 
the  locus  just  described.   Excluding  the  possibility  of  mis- 
takes, a  data  point  (A,B)  is  now  expected  to  fall  not  on  the 
forty-five  degree  line  through  the  intersection  of  the  axes 
but  within  an  area  surrounding  a  given  point  on  the  line. 
The  maximum  amount  by  which  a  pair  of  observations  can  be 
expected  to  vary  a  stated  percentage  of  the  time  solely  due 
to  chance  causes  can  be  determined  and  a  circle  of  statisti- 
cal confidence  can  be  constructed  around  each  point  on  the 
line. 
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The  consistent  recurrence  of  scattered  paired  data 
points  within  such  a  circle  centered  on  the  intersection 
of  the  two  axes  would  indicate  highly  reliable  performance. 
The  observations  would  be  considered  accurate  because  they 
are  clustered  around  the  true  values  of  A  and  B.   They  are 
acceptably  precise  because  they  vary  only  within  the  limits 
of  the  established  performance  standard.   The  consistent 
recurrence  of  paired  data  points  within  such  a  circle  of 
confidence  centered  far  out  on  the  forty-five  degree  line 
would  indicate  an  acceptable  degree  of  precision  but  poor 
accuracy.   The  accuracy  is  considered  poor  because  the 
paired  observations  are  centered  on  a  point  far  removed  from 
the  true  values  of  A  and  B  (Figure  5-4). 

Since  the  forty-five  degree  line  is  the  locus  of  an 
infinite  number  of  points,  the  circles  of  confidence  around 
them  become  a  confidence  band  bounded  by  parallel  lines  on 
each  side  of  the  forty-five  degree  line  at  a  perpendicular 
distance  equal  to  the  radius  of  the  circle  of  confidence 
(Figure  5-5 )  . 

As  a  final  consideration,  assume  the  existence  of 
a  large  group  of  laboratories,  each  having  only  one  opera- 
tor and  one  set  of  equipment.   Also  assume  once  again  the 
existence  of  only  random  errors  so  that  all  data  points  will 
cluster  about  the  intersection  of  the  true  value  axes.   Two 
variances  can  then  be  determined.   The  repeatability 
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FIGURE  5-4 

ZONES  OF  VARIABILITY  ESTABLISHED  BY  SETTING  ARBITRARY 
STANDARDS  FOR  MEASURING  ACCEPTABLE  PRECISION  LIMITS 
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FIGURE  5-5 
DEVELOPMENT  OF  THE  CONFIDENCE  BAND  FOR  PRECISION 
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variance  for  the  test  is  the  random  variance  between  repeat 
measurements  by  the  same  operator  using  the  same  equipment 
in  the  same  laboratory.   The  reproducibility  variance  for 
the  test  is  the  variance  between  measurements  obtained  at 
different  laboratories.   The  reproducibility  variance  will 
normally  be  larger  than  the  repeatability  variance  because 
of  the  introduction  of  additional  sources  of  random,  varia- 
tion. 

Setting  Confidence  Limits 

The  horizontal  deviations  from  the  estimated  true 

population  value,  A,  and  the  vertical  deviations  from  the 

estimated  true  population  value,  3,  are  inc     ient  and 

normally  distributed  and  have  a  common  standard  deviation 

for  the  population  or  for  any  particular  laboratory.   The 

probability  that  a  data  point  (A.,  B.)  is  within  b  standard 

J 

deviations  of  the  point  of  intersection  of  the  two  axes 

(A,B)  can  be  determined  by  integration  in  polar  coord 

ates.     The  expression  which  results  is; 

2\ 
Pr    (be)    =   1    -   exp     -—■  j  (5-1) 

{2o    J 

i  2    2\ 
=    1    -    e:  -  MM 

!      ^      2 
\     2°         J 

-    1    -   exo      —  x—  j  (o-^) 

"      i    2 

where    r    =    the    radial    distance    to    data    point  .3.)    =   ba 

J       J 
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By  rearranging  terms,  an  expression  is  obtained  for 
computing  the  limiting  value  of  b  for  any  desired  confidence 

level . 

>-b2l 
Confidence    Level,    C.L.    =   Pr    (bo)    =    1    -   exp    I  — ~— 

\    z    J 

exp  l—~\    =  1  -  C.L.  (5-3) 

i 

Taking  logarithms  of  both  sides: 

_b2 

-—-  =  In  (1  -  C.L.  ) 


_  -  v-    in(l  -  C.L.) 


b   =  1.414  V-ln(l  -  C.L.)  (5-4) 

The    radius   of    a    circle    of   confidence    around   the    inter- 
section  of    the    two   means,    r„    T     ,    can    also  be    computed. 


rn   _       =  ba    =    1.414    a     'v-ln(i    -   C.L.)  (5-5) 

The    radius    for    a    ninety   five    per    cent    confidence 
level    is; 


rn  =    1.414    a  -ln(l    -    0.95) 


=    1.414    a    •   V-    (-3) 

=    2.45      a  (5-6) 

The  ninety  five  per  cent  level  for  the  difference 

31 
between  two  observations  is  2. /7a,  "   Using  tr. 

Reproducibility  amount  as  a  standard,  for  single  observa- 
tions % 
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R.A.  =  2.77  g„ 

°X    "  2777  (5"7) 

For  the  difference  between  averages  of  two  pairs  of 
observations  (or  between  the  average  of  two  observations 
and  the  average  of  the  two  means); 

=  Z2L.  =   .R-A«_  (5-8) 

x    "  2~    2,77 

Therefore : 

=  (2.45)(R.A.) 
0.95    7,77 

=  0.335  -^4-1 
1  2 

=  0.625(R.A. )  (5-9) 

In  order  to  estimate  the  precision  of  individual 
laboratories'  test  results,  a  straight  line  bisecting  quad- 
rants I  and  III  is  passed  through  the  intersection  of  the 
two  median  lines  at  an  angle  of  forty  five  degrees  to  the 
axis.   Parallel  lines  can  then  be  constructed  on  opposite 
sides  of  this  forty  five  degree  line  to  fern  a  ninety  five 
per  cent  confidence  interval  or  band.   For  convenience,  the 
limits  given  in  AST:"      lards  on  Petroleum  Products 
Lubricants  are  again       to  determine  the  perpendicular 
distance  from,  the  forty  five  degree  line  to  the  boundary  of 
the  confidence  I  As  before,  the  correction  factor  of 

0.625  must  be  applied  to  cc:       he  amount  from,  a  ranae  for 
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a  linear  normal  distribution  to  a  radius  for  a  circu.' 
normal  distribution.   It  may  be  found  to  be  more  convenient 
to  locate  points  on  the  limit  line  by  measuring  the  hori- 
zontal (or  vertical)  rather  than  the  perpendicular  distance 
from  the  forty  five  degree  line.   This  distance  is  deter- 
mined by  multiplying  the  radius  by  the  secant  of  forty  five 
degrees,  1.414. 

The  Reproducibility  amount  rather  than  the  Repeat- 
ability amount  was  chosen  as  the  basis  for  determination  of 
the  ninety-five  per  cent  confidence  limits  in  order  to  have 
a  minimum  standard  applicable  to  all  laboratories. 
Repeatability  amount  is  the  difference  which  a  pair  of 
results  obtained  by  the  same  operator  using  the  same  equip- 
ment she     lot  exceed.   Quite  obviously,  such  precision  is 
statistically  beyond  the  reach  of  a  large  laboratory  if 
paired  results  were  c     led  from  different  comb     ions  of 
equipment  and  operator.   The  Reproducibility  limit s  are  the 
realistic  limits  in  such  cases. 

PROCEDURE 

Data 

The  raw  data  required  are  the  test  results  for  a  given 
property  obtained  from  two  s     2s,  A  and  3,  of  differe: 
batches  of  product  which  have  each  been  divided  and  dis- 
tributed among  the  participating  laboratories.   Althc 
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desirable,  it  is  not  absolutely  necessary  that  both  samples 
be  of  the  same  product.   It  may  be  feasible  to  pair  test 
results  of  a  sample  of  motor  gasoline  with  test  results  of 
a  sample  of  aviation  gasoline  for  example.   The  objective 
is  to  avoid  introducing  additional  sources  of  variability. 
Generally,  this  objective  can  be  accomplished  if  the  test 
procedures  are  identical  and  if  the  two  samples  are  reason- 
ably close  in  the  magnitude  of  the  property  being  evaluated. 

Assumptions 

Analysis  of  the  data  is  based  upon  the  following 
assumptions:   (A)  The  sub-divided  samples  are  homogeneous, 
that  is,  there  is  no  quality  variation  of  the  material  dis- 
tributed to  the  various  participating  laboratories,  (B)  The 
universe  of  observations  for  each  laboratory  and  all  labor- 
atories is  normally  distributed,  (C)  The  test  procedure  Y 
been  proven,  that  is,  it  is  adequately  described  to  preclu 
general  misinterpretation  of  the  exact  procedure  to  be 
followed. 

Plottinc  th s  Data 

Select  the  paired  test  results  to  be  plotted  for  a 
given  property  and  prepare  a  graph  on  rectangular  coordinate 
paper.   Using  the  same  units  and  the  same  scale  on  both  axis, 
mark  an  be  range  on  the  X  axis  and  Y  axis  to  cover 

the  range  of  results  submitted  for  sample  A  and  sample  3 
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respectively.  Plot  the  pairs  of  results  reported  by  the  lab- 
oratories . 

Correlation  tes'c  observations  of  Vapor  Pressure  on 
sample  63-02  and  sample  63-1701  of  Combat  Automotive  Gasoline 
will  be  used  to  illustrate  the  procedure.   These . observa- 
tions are  tabulated  in  Table  XV  as  Test  A  and  Test  3 
respectively.   The  paired  data  points  are  plotted  in 
Figure  5-5. 

Estimating  Central  Tendency 

The  estimated  true  value  of  the  property  for  sample  A 
and  sample  B  can  be  determined  graphically  using  the  median 
as  an  estimator.   The  median  is  chosen  as  the  estimator 
because  of  the  relative  ease  with  which  it  can  be  constructed 
in  comparison  with  the  mean  or  Average  of  t     =st  Two. 
latter  estimators  both  require  computation  to  evaluate  the 
estimate  of  the  population  value.   The  m      z    can  be  deter- 
mined simply  by      Lng  the      :s.   The  median  of 

presented  by  the     bol  A,  is  a  vertical  line  erected 
perpendicular  to  the  A  axis  so  that  the  number  of  data  j 
on  :      :  side  c   the  line  is  equal  as  illustrated  in 
Figure  5-7.   The  rr  presented  '  bol  B. 
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FIGURE    5-6 

PLOT  OF  PAIRED  CORRELATION  TEST  MEASUREMENTS  OF  VAPOR 
PRESSURE  OF  TWO  SAMPLES  OF  COMBAT  AUTOMOTIVE  GASOLINE 
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Setting  Confidence  Limits 

Determine  the  radius  of  the  ninety  five  per  cent  con- 
fidence circle,  rn  qc-,  by  substitution  in  equations  (5-6)  or 
(5-9).,  If  equation  (5-9)  is  to  be  used,  determine  the 
Reproducibility  amount  from  the  applicable  ASTM  Standard 
Method  of  Test. 

The  R.A.  will  be  used  as  the  basis  for  computing  r„  „ 

for  this  example.   From  the  Standard  Method  of  Test  for 

32 

Petroleum  Products,  ASTxM  Designation:  D323-5S,    the  R.A. 

for  automotive  gasoline  in  the  5  to  16  pound  vapor  pressure 
range  is  0.3.   Substituting  in  (5-9): 

r0  95  =  °-625  (°-3°) 
=  0.188 

Construct    the    ninety   five    per    cent   confidence    circle 

for    accuracy  around  the    intersection   of    the   median    lines   A 

and   B,    using   the    radius    r„    _....      With   parallel    rulers,    con- 

i  j  q  _  g^  in- 

struct   a    forty-five    degree    line    (line    passing   through   the 

intersection   of    the   median    lines   and  bisecting   quadrants    I 

and   III)    and  ninety   five    per    cent    precision   confidence    limits 

parallel    to   the    forty-five    degree    line    and   tangent    to   the 

ninety   five    per    cent    circle    for    accuracy.      Figure    5-8 

illustrates    the    completed  graphical   construction. 
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FIGURE  5-8 

CONFIDENCE  LIMITS  FOR  ACCURACY  AND  PRECISION 

OF  DATA  PAIRS  (A.,B.) 

J   J 

The  circle  is  the  ninety  five  per  cent  con- 
fidence limit  for  accuracy.   The  parallel 
lines  tangent  to  the  circle  are  the  ninety 
five  per  cent  confidence  limits  for  precision 
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INTERPRETING  THE  PLOT 

Plotted  results  can  be  interpreted  from  either  of  two 
viewpoints.   The  general  distribution  of  data  points  is  of 
interest  in  determining  the  likelihood  of  sampling  errors. 
The  location  of  individual  data  points  is  the  basis  for  lab- 
oratory evaluation. 

General  Distribution  of  Data  Points 

If  the  only  errors  affecting  the  data  were  random 
errors  of  precision,  positive  and  negative  errors  would  be 
relatively  small  and  would  occur  with  equal  likelihood.   As 
a  result,  data  points  should  be  expected  to  be  tightly 
scattered,  more  or  less  equally,  in  all  four  quadrants  formed 
by  the  intersection  of  the  two  median  lines.   This  is  the 
ideal  situation,  and  is  unlikely  to  occur.   Individual  lab- 
oratory biases  will  normally  cause  laboratories  to  obtain 
results  on  the  true  samples  which  are  either  both  negative 
or  both  positive  in  relation  to  the  median.   A  concentration 
of  data  points  in  Quadrant  I  and  Quadrant  III  can  therefore 
be  expected.   The  more  pronounced  this  tendency  to  individual 
bias,  the  greater  the  departure  will  be  from  the  ideal  cir- 
cular distribution. 

In  the  event  that  the  paired  observations  are  nearly 
equally  divided  among  the  four  quadrants,  the  possibility  of 
invalid  data  resulting  from  a  sample  distribution  error  should 
be  considered. 
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If  the  sample  divisions  distributed  to  the  participating 
laboratories  are  not  homogeneous  as  to  the  property  being 
measured,  some  will  yield  high  results  and  some  will  yield 
low  results.   This  is  true  for  both  samples.   The  equi- 
probable  set  of  paired  results  is: 

(high  A,  high  B;  high  A,  low  Br  low  A,  high  B;  low  A,  low  B) . 
It  follows  that  a  roughly  circular  scatter  of  data  points 
around  the  intersection  of  the  two  medians  could  be  due  to 
heterogeneous  divided  samples. 

Individual  Data  Poir 

Data  points  within  the  circle  surrc    Lng  the  inter- 
section of  the  two  median  lines  indicate  that  the  laboratory 
obtains  results  for  this  test  which  are  acceptably  accurate, 
that  is,  reasonably  free  from  accidental  or  systematic 
error.   Only  five  per  cent  of  the  time  will  a  pair  of  obser- 
vations whose  accuracy  is  affected  only  by  random  errors 
fall  outside  this  circle.   Consequently,  a  data  point  out- 
side the  circle  is  interpreted  as  an  of  probable 
inaccuracy. 

a  points  within        id  surrounding  the  forty  f 
degree  line  indicate  that  the  laboratory  obtains  acceptabJ 
precise  results  for  this  test,  that  is,  the  operators  a:, 
careful  in  their  work  and  the  results  ."..ported  are  free  from 
careless  errors. 
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Examination  of  Figure  5-8  shows  that  the  data,  used 
as  an  example  conform  to  the  general  distribution  pattern 
normally  expected  with  a  tendency  to  cluster  in  quadrant  I 
and  quadrant  III.   The  dispersion  is  greater  than  could  be 
desired  however.   The  indication  is  that  only  four  of  the 
ten  laboratories  are  measuring  the  vapor  pressure  of  combat 
motor  gasoline  with  an  acceptable  degree  of  accuracy.   Lab- 
oratory 2  seems  rather  precise  and  accurate,  being  on  the 
forty  five  degree  line      very  close  to  the  intersection  of 
the  median  lines.   The  observations  reported  by  laboratory  6 
are  also  highly  precise.   However  the  data  point  appears  on 
the  forty  five  degree  line  at  a  considerable  distance  from 
the  intersection  of  the  median  lines         1  outside  the 
circle  of  ninety  five  per  cent  confidence  for  accuracy.   It 
is  noted  that  both  measurements  were  the  highest  submitted 
among  the  ten  laboratories  for  each  sample.    iterpreted  in 
accordance  with  the  standard  for  minimum  accuracy  this 

icates  that  vapor  pres      measurements  of  comb?     tor 
gasoline  by  laboratory  5  te.   The  hi     agree  of 

precision  makes  it  most  probable  that  the  inaccuracy  is  due 
to  a  systematic  error  e-  i   i  g  command  should  direct 

the  laboratory  to  check  possible  sources  of  the  error  ai 
take  corrective  action.   The  same  general  conclusions 
to  laboratories  1  and  3.         ;sults  reported  by  labor  - 
tories  4,  5  and  9  are  incc        e.   Sta      r  alone,  ore  can 
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only  speculate  that  most  probably  a  mistake  has  entered  into 
one  of  the  measurements  of  the  pair  (the  measurement  of 
sample  B) .   In  the  case  of  laboratories  4  and  9,  the  loca- 
tion of  the  data  points  could  be  due  to  a  mistake  entering 
into  one  of  the  measurements,  chance  causes  normal  to  the 
method  (one  out  of  twenty  measurements  will  fall  outside  the 
ninety  five  per  cent  confidence  limits  in  the  long  run),  or 
poor  precision  due  to  modifications  of  the  test  method  or 
due  to  carelessness.   None  of  these  possible  causes  can  be 
considered  most  probable  without  additional  data . 

ALTERNATE  PLOTTING  METHODS 
Additional  analysis  of  relative  performance  can  be 
made  by  comparison  of  multiple  sets  of  paired  observations 
from  each  laboratory.   These  observations  can  be  combined 
and  displayed  in  various  ways.   Consider,  for  example,  a 
subset  of  four  observations,  (A,B,C,D)  representing  the 
results  of  the  same  test  on  four  different  samples  by  the 
same  laboratory.   The  alphabetical  sequence  indicates  the 
chronological  sequence  that  the  tests  were  performed.   The 
time  interval  between  tests  is  one  month  or  more.   There  are 
two  logical  ways  in  which  this  set  of  observations  can  be 
formed  into  subsets  of  paired  data.   The  first  way  is  to 
combine  the  observations  in  chronological  pairs,  without 
duplication,  to  form  the  subset  (AB,CD).   The  other  way  is 
to  combine  the  observations  in  chronological  pairs,  with 
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duplication,  to  form  the  subset  (AB,BC,CD).   The  later  alter- 
native has  the  advantage  that  it  shows  the  path  and  there- 
fore, the  trend  of  the  data  points  more  readily  by  providing 
visual  continuity  from,  one  point  to  the  next. 

The  plotting  procedure  already  described  provides  for 
plotting  the  paired  observations  from  two  samples,  A  and  B, 
submitted  by  m  activities.   It  has  the  time-saving  feature  . 
that  data  are  plotted  directly  as  submitted,  without  pre- 
liminary computation  and  a  measure  of  central  tendency,  the 
median,  can  be  determined  graphically.   Additional  pairs  of 
observations  obtained  from  other  samples,  such  as  BC  and  CD, 
st  be  plotted  separately  to  use  this  procedure.   If  it  is 
desired  to  plot  pairs  obtained  from,  more  than  two  samples 
on  a  single  graph  for  direct  comparison,  some  manipulatic 
is  required  to  align  the  axes  since  the  median  of  each  sample 
will  be  different.   This  can  be  accomplished  by  overlaying 
graphs  so  that  their  axes  coincide  and  tracing  all 
points  cnto  one  grant.      ':hsr  method  is  to  transfer  d 
points  from  one  gr     :o  another  by  measuring  their  distance 
from  the  axes.   A  third     hod  is  to  determine  the  medic 
value  of  the  observations  submitted  for  each  sample  end 
code  the  data  by  converting  the  observations  to  algebraic 
deviations  from  the  m         The  data  points  can  then 
plotted  directly  on  a  p:     red  graph  with  intersecting  med 
lines  labeled  zero. 


Correlation  test  observations  of  vapor  pressure 
sample  54-28  and  sample  64-3600  of  Combat  Automotive  Gaso- 
line are  tabulated  in  Table  XV  as  Test  C  and  Test  D  in 
addition  to  the  two  sets  Test  A  and  Test  B  already  analyzed 

as  a  pair.   The  paired  data  points  (C.,B.)  are  plotted 

J 

-are  5-9  and  the  paired  data  points  (C..D.)  are  plotted  in 

J   J 

Figure  5-10 „   All  three  of  the  available  gr  -s  5-8, 

5-9  and  5-10  will  now  be  interpreted  as  a  group.   By  refer- 
ence to  the  interpreta  s  graphical  analysis  of  the 

oaired  data  set  (A..  3.).  one  can  see  how  tr  ility  of 

J   J 

additional  data  enhances  the  u      y  of  the  method  as  a 
management  tool. 

The  test  results  reported  by  laboratory  2  are 
accurate  and  highly  precise.   A  single  xueas       :  of  t" 
vapor  pressure  of  an  automotive  gasoline  ^: 
accepted  with  a  hj  fidence  i  -   sry 

close  approximation  of  the  true  value.   No  actio:         '.red 
at  the  cor        vel, 

its  reporte       boratory  9  show  very 
poor  precision.   The  pattern  of  alternating  relatively  large 
positive  ai     gativ  the  estim"       ue 

vapor  pressure  of  the  :  icates  that  the  'pc  :e- 

cis:  :e    to  lessness  or  f 

to  follow  strictly  the  \  prescribed  for  the  test. 

Depending  on  the  possibility  that  the  te : 
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by  more  than  one  equipment-operator  combination  (information 
which  would  be  available  to  a  military  commander)  a  third 
possibility  exists.   That  is  the  possibility  that  the  equip- 
ment-operator combinations  are  biased  in  opposite  directic 
This  possibility  is  very  easily  checked  from  test  records. 
2  military  commander  should  direct  laboratory  9  to  check 
the  precision  of  its  measurements  of  automotive  gasoline  by 
internal  investigation  and  experiment  and  initiate  the 
necessary  action  to  improve  the  precision. 

Vapor  pressure  measurement?  made  by  laboratory  4  are 
considered  reliable  with  a  high  degree  of  precision  a 
accuracy.   Data  point  ' * v , B . )  was  close  to  the  B  axis 
although  outside  the  ninety  five  per  cent  confidence  limits 
for  precision  and  accuracy.  "O'-e    indication  is  that  measure- 
ment A.  includes  an  error  c:cz    to  either  a  -  or  random 
causes  with  about  equal  probability.   >Tc  action  is  r - 

Laboratory  5's  test  results  are  acceptably  accurate 

precise.   Data  poini    _ . 3- )  was  close  to  the  A  3 

c  5   5 

although  outside  the  r  cent  confidence  limits 

for  precision  and  accuracy.   The  indication  is  that  measure- 
ments are  net  quite  as  precise  as  those  of  laboratory 
are  generally  accurate.   Ju<         its  distance  from 
estimated  true  vapor  pressure,  the  error  of  meas     mat  Bz 
was  most  probably  due  to  a     bake  but  could  also  have  been 
due  to  random  causes.   No  action  by  the  military  comman 
is  required. 
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The  behavior  of  the  two  paired  data  points  of  labor- 
atories 7,  8  and  10  is  the  same  as  that  of  laboratories  4 
and  5  but  in  reverse  sequence.   Laboratories  7,  8  and  10 
were  within  the  acceptable  limits  of  precision  and  accuracy 
in  the  earlier  period.   The  latest  paired  set  of  measure- 
ments from  each  is  outside  t  :  lies  close  to 
one  or  the  other  of  the  m        xes. 

asurement  Cr   by  _V     tory  6  was  the  same  as  the  C. 
6 

The  latest  measurement,  although  acceptable  as  to  accuracy 
by  the  test,  is  again  the  highest  measurement  submitted  for 
the  sample.  '.  .ication  is  that  laboratory  5  has  not  yet 

located  and  corrected  the  source  of  its  sy        z   error. 
The  military  commander  should  underscore  this  indication  to 
the  laboratory  for        r  attention.   The  same  general 
interpretation  applies  to  the  data  reported  by  laborator 
1  and  3  except  that  their  bias  is  in  the  negative  direction, 
Lgure  5-11  has  t,     prepared  to  show  the  trend  of 
3  data  submitted  by  eac       'ity  in  regard  to  accurecy 
of      jrements.   Tr.     diagrams  are  constructed  to  cne-h 
the  scale  of  Figure  5-?,.  5-9      5-10,  and  data  are  posted 
as  c     tions  fror  the  median  to  make  them  compatible, 
save  interpretations  can  be  derived  from  this  figure  i 
qiven  above „ 
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FIGURE  5-11 

COMBINED  PLOT  OF  DATA  PAIRS 

(A  .,B  .),  (C  .,B  .)  AND  (C  .  ,D  .) 

J   J     J   J  J   J 

BY  LABORATORY 


CHAPTER  VI 
SUMMARY  AND  CONCLUSIONS 

Summary 

In  this  thesis,  the  author  has  investigated  some 
statistical  means  of  obtaining  more  definitive  information 
concerning  the  reliability  of  military  petroleum  testing 
laboratories  than  is  currently  obtained  from  existing  cor- 
relation testing  programs.   Numerical  methods  of  analyzing 
single  observations,  paired  observations  and  multiple 
observations,  and  a  graphical  method  of  analysis  were  dis- 
cussed.  Procedures  were  described  for  analyzing  and  inter- 
preting the  data  by  each  method  and  were  applied  to  actual 
military  correlation  test  data. 

Table  XVI  summarizes  the  tests  which  can  be  applied 
to  each  activity. 

It  was  found  that  for  a  single  observation  one  could 
test  the  hypothesis  at  any  pre-selected  confidence  level 
that  the  single  observation  is  statistically  the  same  as  the 
true  value  of  the  property  being  measured.   Since  there  is 
no  dispersion  to  a  single  measurement  it  cannot  be  tested 
for  precision.   Therefore  no  further  amplification  can  be 
made  of  a  decision  that  a  single  observation  is  statistically 
inaccurate  at  the  selected  confidence  level.   This  method, 
using  a  ninety  five  per  cent  confidence  level  represented 
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TABLE  XVI 

SUMMARY  OF  TESTS  OF  LABORATORY  MEASUREMENTS 

Tests  based  on  the  ASTM  Reproducibility  amount  (R.A.) 
provide  confidence  at  the  ninety  five  per  cent  level. 


Tests  of  Single  Observations 

Hypothesis  test  for  accuracy: 


M.  -  -^ 1  <  X 


-  "0.95  - 


A    R.A 

\i     +  — = — 


(4-6) 


Tests  of  Paired  Observations 

Hypothesis  test  for  precision: 

v.   -  v~  .    <  R.A 
lj     2j 


(4-16) 


Hypothesis  test  for  accuracy  (if  precision  hypothesis 
is  accepted)  : 


R.A 


-         R.A. 

;  v  .   <  +  ■== 

~      2     ~        2    7  2 


2  V~2 

Estimate  of  bias  (if  precision  hypothesis  is  accepted): 


(4-20) 


v 


V,  .  +  v„  . 

_JJ lA 

2 


(4-17) 


Tests  of  Multiple  Observations 
Precision  index: 

Minimum  Standard  fc 


P.I 


J 


2 


-  1.0 


(4-34) 


Minimum  Standard  for  S 


R.A, 


(4-33) 
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TABLE  XVI  (continued) 


Accuracy  index: 

Minimum  Standard  for   v . | 
A.I    =  _ 1 h.      _  1.0         (4-32) 

*  J 


A.C.  .  =   v 
J 


.  I  including  all  v  . 
J  J 


Minimum  Standard  for   I  v.  I   =   R'A/  (4-31) 

Ml  2    V7T 

Estimate    of   bias: 

Bias   estimate    =   v  .    excluding   extreme   values 

Laboratory  Ranking    Index 

n 

LRI  .  =  E  w. z.  .  (d_35) 


v 


i  =  the  weighting  factor  for  test  i  determined  by 
the  relative  significance  of  that  test  to  the 
operational  performance  of  the  product. 

X.  .  -  jl. 

Z-M  =  -^ (4-36) 

a . 


Graphical  Analvsis 


Radius  of  circle  of  confidence  for  accuracy  or 

precision: 

r0.95  =  °-525  (R.A.)  (5-9) 
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by  the  ASTM  Reproducibility  amount,  is  the  current  method  of 
evaluating  correlation  test  data. 

When  two  homogeneous  sets  of  single  observations  were 
pooled  and  analyzed  as  pairs  of  data,  the  hypothesis  that 
the  two  single  observations  of  each  pair  came  from  the  same 
population  could  be  tested  at  the  ninety  five  per  cent  con- 
fidence level,  thereby  measuring  the  relative  precision  of 
the  two  observations.   If  the  precision  hypothesis  was 
accepted,  the  hypothesis  that  the  average  of  the  two  paired 
observations  came  from  the  same  population  as  the  estimated 
p,  could  be  tested  to  determine  the  accuracy  of  the  measure- 
ments.  Again  on  the  prior  condition  that  the  precision 
hypothesis  was  accepted,  the  average  bias  error  of  the  two 
observations  could  be  determined  as  an  indication  of  a 
systematic  error  due  to  assignable  causes. 

When  several  homogeneous  sets  of  single  observations 
were  pooled  and  analyzed  as  a  group,  it  was  found  that 
precision,  accuracy  and  bias  could  be  measured  independently, 
that  is,  the  validity  of  the  test  of  one  quality  of  the 
measurements  had  no  dependence  on  the  prior  outcome  of 
another.   Precision  and  accuracy  were  each  measured  by  an 
easily  interpreted  index  computed  by  comparison  to  estab- 
lished minimum  standards.   The  sign  and  the  magnitude  of  the 
index  indicates  the  relative  goodness  or  poorness  of  accuracy 
or  precision. 
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A  graphical  method  of  analysis  was  developed  which 
requires  only  one  simple  multiplication  calculation  for  its 
initial  application  and  no  mathematical  calculations  there- 
after.  The  data  are  analyzed  in  pairs  requiring  a  minimum 
of  two  sets  of  single  observations.   When  the  analysis  is 
limited  to  two  single  observations  it  was  found  that  the 
same  limitations  were  encountered  in  interpreting  the  results 
when  the  pair  of  observations  were  not  adequately  precise 
as  were  encountered  with  the  numerical  analysis  of  paired 
observations.   Increasing  the  number  of  sets  of  single 
observations  included  in  the  analysis  permitted  more  speci- 
fic interpretation.   When  utilizing  the  graphical  method  of 
analysis,  the  homogeneity  of  data  sets  could  be  verified  by 
observing  the  general  pattern  formed  by  the  plotted  data. 
A  separate  statistical  test  of  homogeneity  of  variance  was 
required  when  using  the  numerical  method. 

Analysis  by  the  graphical  method  was  used  to  illus- 
trate how  the  pooling  of  homogeneous  test  data  sets 
increased  the  effectiveness  of  analysis  of  correlation  test 
results  as  a  management  tool  of  the  military  commander. 

In  the  final  analysis,  the  benefits  of  reliability 
in  performance  of  specific  tests  for  specific  properties 
are  in  correctly  classifying  a  product  as  to  suitability  for 
use.   A  method  of  rating  laboratories  according  to  their 
relative  reliability  in  performance  of  the  family  of  tests 
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associated  with  a  single  product  was  therefore  developed  as 

a  useful  improvement  on  the  Summary  of  Laboratory  Performance 

The  method  provides  for  the  computation  of  a  Laboratory 

Ranking  Index  which  is  a  composite  of  the  relative  accuracy 

* 
of . measurement  of  the  various  properties  of  the  product, 

weighted  in  accordance  with  their  significance  in  regard  to 

the  operational  performance  of  the  product. 

Conclusions 

The  current  method  of  analyzing  correlation  test  data 
is  statistically  too  primitive  to  provide  the  military  com- 
mander with  adequate  intelligence  concerning  the  effective- 
ness of  the  petroleum  testing  laboratories  within  his  area 
of  jurisdiction. 

Maintaining  a  high  degree  of  accuracy  among  the  petrol- 
eum testing  laboratories  is  the  specific  goal  of  a  military 
correlation  testing  program.   But  accuracy  is  a  function  of 
precision  and  bias.   By  analyzing  the  accuracy  of  a  labora- 
tory's work  in  terms  of  precision  and  bias  the  correlation 
testing  program  can  be  made  into  a  more  effective  management- 
by-exception  tool.   This  requires,  as  a  minimum,  analysis  of 
paired  homogeneous  data  sets  or,  preferably,  analysis  of 
multiple  homogeneous  data  sets. 

Further  investigation  of  the  requirements  of  an 
effective  correlation  testing  program  is  strongly  recommended, 
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This  thesis  was  limited  to  investigation  of  some  statistical 
methods  of  evaluating  the  reliability  of  results  of  labora- 
tory tests  of  petroleum  products  and  better  methods  of 
evaluation  were  found.   Many  other  facets  remain  to  be 
explored  before  a  complete  program  can  be  formulated  and 
recommended  for  implementation.   Evaluations  of  optimum 
frequency  of  tests,  evaluation  of  the  significance  of  each 
test,  investigation  of  the  validity  of  using  the  ASTM 
Reproducibility  amount  as  a  standard,  and  investigation  of 
the  relationship  between  correlation  test  measurements 
and  routine  test  observations  are  a  few. 
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